Measurement volume heuristic for faulty measurements detection#145
Measurement volume heuristic for faulty measurements detection#145
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #145 +/- ##
==========================================
+ Coverage 82.77% 83.23% +0.46%
==========================================
Files 78 81 +3
Lines 4871 4981 +110
==========================================
+ Hits 4032 4146 +114
+ Misses 839 835 -4
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
dags/pipeline.py
Outdated
| op_make_observations_hourly | ||
| >> op_make_analysis_hourly | ||
| >> op_make_event_detector_hourly | ||
| >> op_make_volume_analysis_hourly |
There was a problem hiding this comment.
I don't think we need to make this depend on the previous steps. It can run on it's own separate from them on it's own even concurrently.
| query = """ | ||
| SELECT | ||
| probe_cc, probe_asn, engine_version, | ||
| software_version, platform, architecture, |
There was a problem hiding this comment.
I would suggest adding here the software_name key as well
|
I ran an analysis with the threshold you suggested above on the data from the first 10 days of January and I get 541 anomalies. This is the query I ran: SELECT
probe_cc, probe_asn, engine_version, software_name,
software_version, platform, architecture,
toStartOfMinute(measurement_start_time) as minute_start,
test_name,
count() as total
FROM fastpath
WHERE
measurement_start_time >= '2026-01-01' AND
measurement_start_time < '2026-01-10'
GROUP BY probe_cc, probe_asn, engine_version, test_name, software_name, software_version, platform, architecture, minute_start
HAVING total >= 200I then counted the occurrences of anomalies by the df_volume_anomaly[
['probe_cc', 'probe_asn', 'platform', 'software_version', 'software_name','test_name', 'total']
].groupby(['probe_cc', 'probe_asn', 'platform', 'software_version', 'test_name', 'software_name']).count().reset_index().to_markdown()and get:
The mean of the thresholds is:
It seems like this should be a pretty reasonable starting point as we are capturing a fair number of anomalies, but at the same time it's not so noisy. |
This PR implements the volume analysis heuristic for detecting faulty measurements
closes #144