Idea about new alarm design #2873

tomsun28 · 2024-12-12T12:38:01Z

tomsun28
Dec 12, 2024
Collaborator

Hi, i have a idea about new alarm design, how about this below?

Refer from promethues alertmanager

The main part is to add support for promql and sql timing threshold expression calculation rules, and reconstruct the alarm entity object structure and threshold rule object structure.

alarm json, refer from prometheus alarm structure.

remove the priority put it in labels

{
        "labels": {
          "alertname": "HighCPUUsage",
          "priority": "critical",
          "instance": "343483943"
        },
        "annotations": {
          "summary": "High CPU usage detected"
        },
        "content": "CPU usage is above 80% for the last 5 minutes on instance server1.example.com.", 
        "status": "firing|resolved",
        "triggerTimes": 1,
        "startAt": "1734005477630",
        "activeAt": "1734005477630",
        "endAt": null,
      }
}

group alarm entity json, refer from prometheus

groupLabels is empty when not group by

{
  "status": "resolved",
  "groupLabels": {
    "alertname": "HighCPUUsage"
  },
  "commonLabels": {
    "alertname": "HighCPUUsage",
    "instance": "server1",
    "severity": "critical"
  },
  "commonAnnotations": {
    "summary": "High CPU usage detected",
    "description": "CPU usage is back to normal for server1"
  },
  "alerts": [
    {
      "status": "resolved",
      "labels": {
        "alertname": "HighCPUUsage",
        "instance": "server1",
        "severity": "critical"
      },
      "annotations": {
        "summary": "High CPU usage detected",
        "description": "CPU usage is back to normal for server1"
      },
      "content": "CPU usage is above 80% for the last 5 minutes on instance server1.example.com.", 
      "startAt": "1734005477630",
      "endAt": "1734005477630"
    },
  {
    "status": "firing",
    "labels": {
      "alertname": "HighMemoryUsage",
      "instance": "server1",
      "severity": "warning"
    },
    "annotations": {
      "summary": "Memory usage is high",
      "description": "Memory usage exceeds 90% on server1"
    },
    "content": "Memory usage exceeds 90% on server1", 
    "startAt": "1734005477630",
    "activeAt": "1734005477630",
    "endAt": "1734005477630"
  }
  ]
}

welcome to discuss.

zqr10159 · 2024-12-12T13:29:01Z

zqr10159
Dec 12, 2024
Collaborator

Consider adding a data source integration layer before the “collect metrics data queue” to unify data formats from different systems and services.

1 reply

tomsun28 Dec 12, 2024
Collaborator Author

+1 Good idea

starryCoder · 2024-12-12T14:04:56Z

starryCoder
Dec 12, 2024

Real-time metric data is stored in TSDB, which may cause overlap between real-time threshold calculations from the collected metric data queue and periodic threshold calculations from TSDB.

2 replies

Calvin979 Dec 12, 2024
Collaborator

Real-time metric data is stored in TSDB, which may cause overlap between real-time threshold calculations from the collected metric data queue and periodic threshold calculations from TSDB.

It can be avoided by setting a strategy on periodic threshold calculations. Such as scan data 10 mins ago

tomsun28 Dec 12, 2024
Collaborator Author

Yes, the periodic threshold calculate can run promql like this: avg_over_time(rate(node_cpu_seconds_total{mode!="idle"}[10m])) > 90. Two ways are enhanced each other. Promql and SQL expressions are also mainway methods, which are suitable for our future push data calculate.

Calvin979 · 2024-12-12T15:20:22Z

Calvin979
Dec 12, 2024
Collaborator

Nice design and clear architecture layers. I think the most challenging thing is how to define our data structure internally

1 reply

tomsun28 Dec 12, 2024
Collaborator Author

yes, I will implement the alarm structure definition later, the thresold rule structure is also need to be redefine.

Aias00 · 2024-12-13T01:32:11Z

Aias00
Dec 13, 2024
Collaborator

real time calculation and periodic calculation may make alarm notify to same data repeatability？

1 reply

tomsun28 Dec 13, 2024
Collaborator Author

This will happen if both are set for a certain metric. We can also merge them in alarm groups or convergence. The cycle threshold is mainly a supplement to the mainstream promql and sql expression calculation. The user decides which of the two methods to use.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Idea about new alarm design #2873

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Idea about new alarm design #2873

Uh oh!

tomsun28 Dec 12, 2024 Collaborator

Replies: 4 comments · 5 replies

Uh oh!

zqr10159 Dec 12, 2024 Collaborator

Uh oh!

tomsun28 Dec 12, 2024 Collaborator Author

Uh oh!

starryCoder Dec 12, 2024

Uh oh!

Calvin979 Dec 12, 2024 Collaborator

Uh oh!

tomsun28 Dec 12, 2024 Collaborator Author

Uh oh!

Calvin979 Dec 12, 2024 Collaborator

Uh oh!

tomsun28 Dec 12, 2024 Collaborator Author

Uh oh!

Aias00 Dec 13, 2024 Collaborator

Uh oh!

tomsun28 Dec 13, 2024 Collaborator Author

tomsun28
Dec 12, 2024
Collaborator

Replies: 4 comments 5 replies

zqr10159
Dec 12, 2024
Collaborator

tomsun28 Dec 12, 2024
Collaborator Author

starryCoder
Dec 12, 2024

Calvin979 Dec 12, 2024
Collaborator

tomsun28 Dec 12, 2024
Collaborator Author

Calvin979
Dec 12, 2024
Collaborator

tomsun28 Dec 12, 2024
Collaborator Author

Aias00
Dec 13, 2024
Collaborator

tomsun28 Dec 13, 2024
Collaborator Author