Skip to content

Emit block cache metrics#4518

Merged
DomGarguilo merged 8 commits into
apache:2.1from
DomGarguilo:blockCacheMetrics
May 7, 2024
Merged

Emit block cache metrics#4518
DomGarguilo merged 8 commits into
apache:2.1from
DomGarguilo:blockCacheMetrics

Conversation

@DomGarguilo
Copy link
Copy Markdown
Member

Fixes #4492

Adds cache hit and cache request count metrics for data, index and summary caches.

Here is an example when metrics are set up to be logged:

METRICS: 2024-05-02T14:46:46,628, accumulo.blockcache.hitcount.data{host=localhost,instance.name=uno,port=9997,process.name=tserver} value=253
METRICS: 2024-05-02T14:46:46,628, accumulo.blockcache.hitcount.index{host=localhost,instance.name=uno,port=9997,process.name=tserver} value=4
METRICS: 2024-05-02T14:46:46,628, accumulo.blockcache.hitcount.summary{host=localhost,instance.name=uno,port=9997,process.name=tserver} value=0
METRICS: 2024-05-02T14:46:46,628, accumulo.blockcache.requestcount.data{host=localhost,instance.name=uno,port=9997,process.name=tserver} value=287
METRICS: 2024-05-02T14:46:46,628, accumulo.blockcache.requestcount.index{host=localhost,instance.name=uno,port=9997,process.name=tserver} value=32
METRICS: 2024-05-02T14:46:46,628, accumulo.blockcache.requestcount.summary{host=localhost,instance.name=uno,port=9997,process.name=tserver} value=0

I am not sure if I named these that well so feedback on that would be helpful.

@EdColeman
Copy link
Copy Markdown
Contributor

This is a general observation - prompted by this PR, but is more general.

What it the purpose of the html javadocs table? From the description it is only for mapping old metrics names to new, but there are entries in there now that were not present in the old metrics system. So, it seems to have also become a place to document new metrics.

If the table is only for mapping changes, then this PR is correct to not add the new metrics to the table.

If the table has morphed into a way to document the available metrics so that they show up in documentation for user reference., then the new metric names and descriptions should be added to the table. Also, the general description of the purpose of the table should be changed to state that it reflects all metrics and includes the mapping from old to new where appropriate.

I do not know how we wish to document the metrics - the table is one way, but it is tedious for developers to keep in sync. The docs generated by the table do seem useful, I just don't know if this is the best way, or if we could find something that is easier to maintain. It does seem that we should provide documentation for the metric names.

We should also determine how "public" as in public API are the metric names. Similar to Properties, they are internal, so subject to change. But, they are also used externally and changes could cause issues with scripts, or in the case of metrics tracking and post-processing.

Currently I have been changing the metric names to be more consistent and hopefully follow a standard convention. Being that the 2.1 metrics are new and have not received a lot of scrutiny, changing non-public name seems like it would have minor impact now, but as the metrics are adopted more widely, changes, additions or deletions should at least be communicated - somehow.

@DomGarguilo DomGarguilo linked an issue May 2, 2024 that may be closed by this pull request
@keith-turner
Copy link
Copy Markdown
Contributor

Could move the cache type higher in the name. This brings the related data together when sorting names or using completion.

accumulo.blockcache.data.hitcount{host=localhost,instance.name=uno,port=9997,process.name=tserver} value=253
accumulo.blockcache.data.requestcount{host=localhost,instance.name=uno,port=9997,process.name=tserver} value=287
accumulo.blockcache.index.requestcount{host=localhost,instance.name=uno,port=9997,process.name=tserver} value=32
accumulo.blockcache.index.hitcount{host=localhost,instance.name=uno,port=9997,process.name=tserver} value=4
accumulo.blockcache.summary.requestcount{host=localhost,instance.name=uno,port=9997,process.name=tserver} value=0
accumulo.blockcache.summary.hitcount{host=localhost,instance.name=uno,port=9997,process.name=tserver} value=0

@keith-turner
Copy link
Copy Markdown
Contributor

keith-turner commented May 2, 2024

In addition to the tablet sever, also need to register these new metrics in the scan server. When registering in the scan server will need to pick up the resource group tag, which may happen automatically.

@DomGarguilo
Copy link
Copy Markdown
Member Author

When registering in the scan server will need to pick up the resource group tag, which may happen automatically.

I'm not sure I follow. Are you suggesting to add a Tag to the metrics that indicates the resource group name?

@EdColeman
Copy link
Copy Markdown
Contributor

PR #4461 adds the resource.group as a common tag - as long as you follow the same metrics init that you have in tserver the resource group should be automatically assigned.

# Conflicts:
#	server/tserver/src/main/java/org/apache/accumulo/tserver/ScanServer.java
Comment thread server/tserver/src/main/java/org/apache/accumulo/tserver/BlockCacheMetrics.java Outdated

@Override
public void registerMetrics(MeterRegistry registry) {
Gauge.builder("indexCacheHitCount", indexCache, cache -> cache.getStats().hitCount())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these are monotonically increasing counters, then it would be better to use a FunctionCounter instead of a Gauge. Not sure if they are, need to investigate. The micrometer code to instrument caches uses FunctionCounter internally

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked at the cache impls in Accumulo, these counters are monotonically increasing counters for those. We should probably add that to the SPI javadoc.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would those also be candidates to be FunctionCounters? (maybe outside of this PR)

@DomGarguilo
Copy link
Copy Markdown
Member Author

After changing to the FunctionCounter, the way they are displayed (by default at least) is as follows when set up to log:

METRICS: 2024-05-06T16:26:53,938, accumulo.blockcache.data.hitcount{host=localhost,instance.name=uno,port=9997,process.name=tserver} throughput=0.533333/s
METRICS: 2024-05-06T16:26:53,939, accumulo.blockcache.data.requestcount{host=localhost,instance.name=uno,port=9997,process.name=tserver} throughput=0.6/s
METRICS: 2024-05-06T16:26:53,939, accumulo.blockcache.index.hitcount{host=localhost,instance.name=uno,port=9997,process.name=tserver} throughput=1.2/s
METRICS: 2024-05-06T16:26:53,939, accumulo.blockcache.index.requestcount{host=localhost,instance.name=uno,port=9997,process.name=tserver} throughput=2.2/s

@EdColeman
Copy link
Copy Markdown
Contributor

I confirmed that the MetricsIT passes with the current changes.

@keith-turner
Copy link
Copy Markdown
Contributor

Wondering if the cache type should be a tag as opposed to being part of the metric name, but do not have a good sense of the pros and cons of that ATM.

@keith-turner
Copy link
Copy Markdown
Contributor

@EdColeman I saw your question, I think there are some other metrics in Accumulo that are monotonically increasing that may be better suited as a FunctionCounter. Not sure if those should be changed in 2.1 though. For example accumulo.tserver.scan.scanned.entries may be a candidate for switching from a gauge to a function counter. Also unrelated to FunctionCounter I noticed there are accumulo.tserver.scan and accumulo.tserver.scans prefixes for metrics.

@DomGarguilo
Copy link
Copy Markdown
Member Author

I noticed there are accumulo.tserver.scan and accumulo.tserver.scans prefixes for metrics.

@keith-turner I can't find the scans in any metrics prefix. Where did you find that?

@dlmarion
Copy link
Copy Markdown
Contributor

dlmarion commented May 7, 2024

What it the purpose of the html javadocs table? From the description it is only for mapping old metrics names to new, but there are entries in there now that were not present in the old metrics system. So, it seems to have also become a place to document new metrics.

The HTML table was the simplest way that I could think of to document the metric name changes as we moved from Hadoop Metrics to Micrometer. I do agree that we need a good way of documenting them, especially since some of them may be used not just for monitoring, but resource control, in future versions.

@dlmarion
Copy link
Copy Markdown
Contributor

dlmarion commented May 7, 2024

I noticed there are accumulo.tserver.scan and accumulo.tserver.scans prefixes for metrics.

@keith-turner I can't find the scans in any metrics prefix. Where did you find that?

These were changed in #4461. From accumulo.tserver.scans to accumulo.scan. I think we are moving towards removing the process name from the metric as it's already in the tags.

@ctubbsii ctubbsii added this to the 2.1.3 milestone Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Emit metrics for block caches

5 participants