Replies: 4 comments 5 replies
-
|
Consider adding a data source integration layer before the “collect metrics data queue” to unify data formats from different systems and services. |
Beta Was this translation helpful? Give feedback.
-
|
Real-time metric data is stored in TSDB, which may cause overlap between real-time threshold calculations from the collected metric data queue and periodic threshold calculations from TSDB. |
Beta Was this translation helpful? Give feedback.
-
|
Nice design and clear architecture layers. I think the most challenging thing is how to define our data structure internally |
Beta Was this translation helpful? Give feedback.
-
|
real time calculation and periodic calculation may make alarm notify to same data repeatability? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi, i have a idea about new alarm design, how about this below?
Refer from promethues alertmanager
The main part is to add support for promql and sql timing threshold expression calculation rules, and reconstruct the alarm entity object structure and threshold rule object structure.
alarm json, refer from prometheus alarm structure.
{ "labels": { "alertname": "HighCPUUsage", "priority": "critical", "instance": "343483943" }, "annotations": { "summary": "High CPU usage detected" }, "content": "CPU usage is above 80% for the last 5 minutes on instance server1.example.com.", "status": "firing|resolved", "triggerTimes": 1, "startAt": "1734005477630", "activeAt": "1734005477630", "endAt": null, } }group alarm entity json, refer from prometheus
groupLabels is empty when not group by
{ "status": "resolved", "groupLabels": { "alertname": "HighCPUUsage" }, "commonLabels": { "alertname": "HighCPUUsage", "instance": "server1", "severity": "critical" }, "commonAnnotations": { "summary": "High CPU usage detected", "description": "CPU usage is back to normal for server1" }, "alerts": [ { "status": "resolved", "labels": { "alertname": "HighCPUUsage", "instance": "server1", "severity": "critical" }, "annotations": { "summary": "High CPU usage detected", "description": "CPU usage is back to normal for server1" }, "content": "CPU usage is above 80% for the last 5 minutes on instance server1.example.com.", "startAt": "1734005477630", "endAt": "1734005477630" }, { "status": "firing", "labels": { "alertname": "HighMemoryUsage", "instance": "server1", "severity": "warning" }, "annotations": { "summary": "Memory usage is high", "description": "Memory usage exceeds 90% on server1" }, "content": "Memory usage exceeds 90% on server1", "startAt": "1734005477630", "activeAt": "1734005477630", "endAt": "1734005477630" } ] }welcome to discuss.
Beta Was this translation helpful? Give feedback.
All reactions