-
Notifications
You must be signed in to change notification settings - Fork 9
Feature request: allow Prometheus to scrape log-cache directly #96
Description
Hello there,
I’m an operator of a multi-tenant OSS CF for the UK Government.
We provide a mechanism for our tenants to extract their app metrics and store them in Prometheus. To do this, we've had to build a few moving parts, with one significant piece being a /metrics endpoint which sits in front of log-cache and provides a view of the user visible metrics (based on the Authorization header) in a format which standard Prometheus scraping can ingest.
Suggested feature
We understand from this project's README and some #logcache Slack chat that a future aim is to provide endpoints which would enable existing PromQL API clients to talk to log-cache natively. We wonder if there would be any interest in also providing a /metrics endpoint such that Prometheus itself could scrape log-cache directly - effectively providing a prometheus exporter for log-cache?
This would bring several advantages for consumers, allowing:
- teams to use Prometheus' AlertManager and other non-PromQL ecosystem components
- application teams to decouple their stats from the platform
- operators to use existing Promethesis, which can be persistent and durable in a manner which we don't believe log-cache is currently designed (or aiming) to achieve.
We feel the /metrics contract is a good one for log-cache to expose to the Prometheus universe, as log-cache already uses the concept of "you'll see all the stats for which your API token gives you visibility" via the Authorization header.
Prometheus also contains the concept of a /federate API, which is similar to /metrics. /federate requires (or, at least, strongly suggests) that a consumer feeds the endpoint with a filter for stats it would like to see.
We think that /federate is probably not a great fit for log-cache's use case, where the set of metrics that a single OAuth token can access is implicit within the system. Providing a secondary restriction over the top of that set seems to run counter to log-cache's existing approach: just exposing all the stats which are available to the requestor.
Potential difficulties
Prometheus expects some metrics to reset (or be removed) when their last value becomes stale. If this isn't done properly we have observed issues like:
- metrics aren't removed when apps are deleted
- metrics for a given cell aren't removed when the app migrates to another cell
- metrics for an instance aren't removed when the app scales down
This may be solvable only by reading logs from Doppler, or may require log-cache to reach out to other parts of the system (which may be undesirable).
Next steps
If adding a /metrics endpoint aligns with your plans for log-cache we (GOV.UK PaaS) would be happy to contribute design and code as required. In the short term it's likely we'll implement something similar in spirit ourselves, as we already have live tenants using Prometheus via the projects mentioned below.
If this is not something that is likely to be added to log-cache then we may alter the design for our own metrics solutions to make them a more long-term part of our platform.
References
- https://github.com/alphagov/paas-metric-exporter - an internal project (useful for context only) which exports metrics to statsd and prometheus from Doppler.
- https://github.com/alphagov/paas-log-cache-adapter - an internal project (useful for context only) which exports metrics to prometheus from log-cache.
- https://cloudfoundry.slack.com/archives/CBFB7NP9B/p1540392230000100 - #logcache Slack chat
- https://prometheus.io/docs/instrumenting/writing_exporters/ - writing prometheus exporters
- Explicitly expire prometheus metrics when apps are deleted govuk-paas/paas-metric-exporter#33: an example of non-obvious difficulties we've found while working in this problem space