-
Notifications
You must be signed in to change notification settings - Fork 311
Description
What is this issue about?
At SAP BTP networking and routing we regularly face
problems such as these:
- "my CF app can't connect to xxx"
- "I can't connect to my CF app"
- "Why is my app so slow?"
- "I get strange timeouts trying to connect to the platform"
Operators on the other hand get complaints like these:
- "the customer gets a TLS handshake error with the platform and claims it's our fault. how can I see what's going on during that handhake?"
- "we get connectivity issues between gorouter and the backend app. how do we debug them?"
If logs are not helping, the issue is usually resolved by helping the customer or the operator run a tcpdump of their application and analyzing the pcap files in Wireshark.
Of course, this means a lot of work for operations and development, but what if the users themselves were able to capture their app's traffic?
Enter Pcap-Release
We have started working on a solution that allows regular CF users as well as BOSH operators to debug into the network traffic of apps. The system is composed of three parts:
- Pcap-Agent: A BOSH job running
on every Diego Cellinside CF app containers as well as BOSH VMs. It canenter the network namespace of a CF app container andtap into its network devices using libpcap and BPF filters just like tcpdump does. It leverages gopacket, a golang pcap library by Google. - Pcap-API: A BOSH job providing a public end-user API that clients can talk to. It is responsible for authentication (via UAA), authorization (via CloudController and BOSH director) and connects to the Pcap-Agent running on the target VM where the user's application or BOSH instance is running as well as streaming the pcap data back to the user.
- Pcap-CLI: A CF CLI plugin that provides a convenient way of connecting to Pcap-API. Allows the user to specify which app they want to tap, which instance(s) of the app, what BPF filter should be applied and on which network interface they want to capture traffic. There is also a separate CLI for the BOSH case.
The project is currently hosted under my org as pcap-server-release has moved to a permanent location cloudfoundry/pcap-release
The repository provides an ops-file that integrates the pcap-release with Diego as well as an example manifest that deploys the pcap-API onto its own VMs.
Architecture
Explanation: Stream to User
- Pcap-API uses route-registrar to publish a route that is called by CF CLI
- CF CLI logs into UAA, selects org/space, receives access token
- CF CLI selects app to capture and sends it alongside access token to Pcap-API
- Pcap-API uses access token to check if user is logged in and can actually see the app to be tapped
- Pcap-API uses Cloud Controller to discover the location of the Diego Cell(s) that host the app
- Pcap-API connects to Pcap-Agents hosted on these cells and starts the capture
- Pcap-API collects each pcap stream from Diego Cells and demultiplexes them into one single stream
- Pcap-API returns the pcap stream back to the end user
Explanation: Stream to Storage / Download later
This is needed if traffic is too much for end user to handle. The traffic is instead streamed to an object store (like AWS S3) and tagged with the user's id.
- Pcap-API uses route-registrar to publish a route that is called by CF CLI
- CF CLI logs into UAA, selects org/space, receives access token
- CF CLI selects app to capture and sends it alongside access token to Pcap-API
- Pcap-API uses access token to check if user is logged in and can actually see the app to be tapped
- Pcap-API uses Cloud Controller to discover the location of the Diego Cell(s) that host the app
- Pcap-API connects to Pcap-Agents hosted on these cells and starts the capture
- Pcap-API collects each pcap stream from Diego Cells and demultiplexes them into one single stream
- Pcap-API uploads the pcap stream to object store and tags it with the user's id
- Pcap-API provides a download URL to the end user
- User downloads pcap file using download URL
- Object store removes pcap file automatically after a retention period
Current status / next steps of the project
The project is considered pre-alpha. Basic use cases are working, some authentication and authorization is done. Pcap-API URL is registered using route-registrar. Connection between Pcap-API and Pcap-Agent is secured using mTLS.
Use Cases Complete / Missing:
- Capture single CF instance, stream to user
- Capture multiple CF instances, stream to user
- Capture single CF instance, upload to storage API for later download
- Capture multiple CF instances, upload to storage API for later download
- Capture single BOSH instance, stream to user
- Capture multiple BOSH instances, stream to user
- Capture single BOSH instance, upload to storage API for later download
- Capture multiple BOSH instances, upload to storage API for later download
Next steps will likely be:
- Implement BOSH streaming use cases
- Add more tests and documentation
- Code refactoring. Add better error handling, add middleware for authentication and other benefits like rate-limiting
- POC of non-streaming use cases using AWS S3
The goal of this issue
We recently showed a demo of the release to the app runtime platform wg audience. It was well received and suggested to bring it to cf-deployment to discuss options to integrate it.
We would like to use this issue to answer the following questions:
- Is there enough interest in the community to make this feature part of cf-deployment?
- If yes, what would be the best place to put it within cf-deployment?
- If yes, can we consider moving the home repo under the cloudfoundry or cloudfoundry-incubator org?
Feel free to reach out to me on CF-Community Slack also! Handle is @domdom82
Metadata
Metadata
Assignees
Labels
Type
Projects
Status