Skip to content

Tcpdump for Everyone: Proposal to add pcap-release to cf-deployment #980

@domdom82

Description

@domdom82

What is this issue about?

At SAP BTP networking and routing we regularly face
problems such as these:

  • "my CF app can't connect to xxx"
  • "I can't connect to my CF app"
  • "Why is my app so slow?"
  • "I get strange timeouts trying to connect to the platform"

Operators on the other hand get complaints like these:

  • "the customer gets a TLS handshake error with the platform and claims it's our fault. how can I see what's going on during that handhake?"
  • "we get connectivity issues between gorouter and the backend app. how do we debug them?"

If logs are not helping, the issue is usually resolved by helping the customer or the operator run a tcpdump of their application and analyzing the pcap files in Wireshark.

Of course, this means a lot of work for operations and development, but what if the users themselves were able to capture their app's traffic?

Enter Pcap-Release

We have started working on a solution that allows regular CF users as well as BOSH operators to debug into the network traffic of apps. The system is composed of three parts:

  • Pcap-Agent: A BOSH job running on every Diego Cell inside CF app containers as well as BOSH VMs. It can enter the network namespace of a CF app container and tap into its network devices using libpcap and BPF filters just like tcpdump does. It leverages gopacket, a golang pcap library by Google.
  • Pcap-API: A BOSH job providing a public end-user API that clients can talk to. It is responsible for authentication (via UAA), authorization (via CloudController and BOSH director) and connects to the Pcap-Agent running on the target VM where the user's application or BOSH instance is running as well as streaming the pcap data back to the user.
  • Pcap-CLI: A CF CLI plugin that provides a convenient way of connecting to Pcap-API. Allows the user to specify which app they want to tap, which instance(s) of the app, what BPF filter should be applied and on which network interface they want to capture traffic. There is also a separate CLI for the BOSH case.

The project is currently hosted under my org as pcap-server-release has moved to a permanent location cloudfoundry/pcap-release

The repository provides an ops-file that integrates the pcap-release with Diego as well as an example manifest that deploys the pcap-API onto its own VMs.

Architecture

full architecture

Explanation: Stream to User

  1. Pcap-API uses route-registrar to publish a route that is called by CF CLI
  2. CF CLI logs into UAA, selects org/space, receives access token
  3. CF CLI selects app to capture and sends it alongside access token to Pcap-API
  4. Pcap-API uses access token to check if user is logged in and can actually see the app to be tapped
  5. Pcap-API uses Cloud Controller to discover the location of the Diego Cell(s) that host the app
  6. Pcap-API connects to Pcap-Agents hosted on these cells and starts the capture
  7. Pcap-API collects each pcap stream from Diego Cells and demultiplexes them into one single stream
  8. Pcap-API returns the pcap stream back to the end user

Explanation: Stream to Storage / Download later

This is needed if traffic is too much for end user to handle. The traffic is instead streamed to an object store (like AWS S3) and tagged with the user's id.

  1. Pcap-API uses route-registrar to publish a route that is called by CF CLI
  2. CF CLI logs into UAA, selects org/space, receives access token
  3. CF CLI selects app to capture and sends it alongside access token to Pcap-API
  4. Pcap-API uses access token to check if user is logged in and can actually see the app to be tapped
  5. Pcap-API uses Cloud Controller to discover the location of the Diego Cell(s) that host the app
  6. Pcap-API connects to Pcap-Agents hosted on these cells and starts the capture
  7. Pcap-API collects each pcap stream from Diego Cells and demultiplexes them into one single stream
  8. Pcap-API uploads the pcap stream to object store and tags it with the user's id
  9. Pcap-API provides a download URL to the end user
  10. User downloads pcap file using download URL
  11. Object store removes pcap file automatically after a retention period

Current status / next steps of the project

The project is considered pre-alpha. Basic use cases are working, some authentication and authorization is done. Pcap-API URL is registered using route-registrar. Connection between Pcap-API and Pcap-Agent is secured using mTLS.

Use Cases Complete / Missing:

  • Capture single CF instance, stream to user
  • Capture multiple CF instances, stream to user
  • Capture single CF instance, upload to storage API for later download
  • Capture multiple CF instances, upload to storage API for later download
  • Capture single BOSH instance, stream to user
  • Capture multiple BOSH instances, stream to user
  • Capture single BOSH instance, upload to storage API for later download
  • Capture multiple BOSH instances, upload to storage API for later download

Next steps will likely be:

  • Implement BOSH streaming use cases
  • Add more tests and documentation
  • Code refactoring. Add better error handling, add middleware for authentication and other benefits like rate-limiting
  • POC of non-streaming use cases using AWS S3

The goal of this issue

We recently showed a demo of the release to the app runtime platform wg audience. It was well received and suggested to bring it to cf-deployment to discuss options to integrate it.

We would like to use this issue to answer the following questions:

  • Is there enough interest in the community to make this feature part of cf-deployment?
  • If yes, what would be the best place to put it within cf-deployment?
  • If yes, can we consider moving the home repo under the cloudfoundry or cloudfoundry-incubator org?

Feel free to reach out to me on CF-Community Slack also! Handle is @domdom82

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Waiting for Changes

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions