Rules

Blackbox exporter alerts

25.483s ago

1.515ms

Rule State Error Last Evaluation Evaluation Time
alert: ProbeFailure expr: probe_success{job="blackbox"} == 0 for: 10m labels: severity: critical annotations: description: The {{ $labels.module }} probe to {{ $labels.instance }} has failed due to protocol errors or failed checks. generic_summary: Blackbox probe failed summary: The probe to {{ $labels.instance }} has failed ok 25.484s ago 427.6us
alert: HTTPSNotUsed expr: probe_http_ssl{job="blackbox",module=~"https(_ipv6)?"} == 0 for: 10m labels: severity: warning annotations: description: The HTTP server at {{ $labels.instance }} did not redirect to HTTPS, or SSL failed. generic_summary: HTTPS not used summary: The HTTP server at {{ $labels.instance }} did not force SSL ok 25.484s ago 397.3us
alert: SSLCertExpiringSoon expr: probe_ssl_earliest_cert_expiry{job="blackbox"} - time() < 86400 * 15 for: 10m labels: severity: warning annotations: description: The SSL certificate at {{ $labels.instance }} will expire in {{ humanizeDuration $value }} days. generic_summary: SSL certificate expiring soon summary: The SSL certificate at {{ $labels.instance }} will expire soon ok 25.484s ago 285.3us
alert: SSLCertExpiringSoon expr: probe_ssl_earliest_cert_expiry{job="blackbox"} - time() < 86400 * 7 for: 10m labels: severity: critical annotations: description: The SSL certificate at {{ $labels.instance }} will expire in {{ humanizeDuration $value }} days. generic_summary: SSL certificate expiring VERY soon summary: The SSL certificate at {{ $labels.instance }} will expire VERY soon ok 25.483s ago 190.3us

Cronjob alerts

4.273s ago

746.8us

Rule State Error Last Evaluation Evaluation Time
alert: FailedCronJob expr: batch_last_finish_seconds > batch_last_success_seconds for: 5m labels: severity: warning annotations: description: The last run of cronjob {{ $labels.job }} in {{ $labels.instance }} has failed. generic_summary: Cronjob failed summary: Cronjob {{ $labels.job }} in {{ $labels.instance }} has failed ok 4.273s ago 331.6us
alert: SlowCronJob expr: batch_running_time_seconds > 7200 for: 5m labels: severity: info annotations: description: The last run of cronjob {{ $labels.job }} in {{ $labels.instance }} has taken more than 2 hours. generic_summary: Cronjob too slow summary: Cronjob {{ $labels.job }} in {{ $labels.instance }} is too slow ok 4.273s ago 125.1us
alert: StuckCronJob expr: batch_running_time_seconds > 14400 for: 5m labels: severity: warning annotations: description: The last run of cronjob {{ $labels.job }} in {{ $labels.instance }} has taken more than 4 hours, and it is considered stuck/hung. generic_summary: Cronjob stuck summary: Cronjob {{ $labels.job }} in {{ $labels.instance }} is stuck ok 4.273s ago 83.74us
alert: MissingCronJob expr: time() - batch_last_start_seconds > batch_period_seconds for: 5m labels: severity: warning annotations: description: The cronjob {{ $labels.job }} in {{ $labels.instance }} has not run in the expected period. generic_summary: Cronjob missing summary: Cronjob {{ $labels.job }} in {{ $labels.instance }} has not run ok 4.273s ago 189.4us

General alerts

20.418s ago

915.5us

Rule State Error Last Evaluation Evaluation Time
alert: InstanceDown expr: up == 0 or pg_up == 0 for: 5m labels: severity: critical annotations: description: '{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes.' generic_summary: Service down summary: Instance {{ $labels.instance }} down ok 20.418s ago 904.4us

Node alerts

9.533s ago

22.04ms

Rule State Error Last Evaluation Evaluation Time
alert: HighCpuUsage expr: (1 - instance:node_cpu_seconds_total:avg_rate5m{job="node",mode="idle"}) * 100 > 90 for: 5m labels: severity: info annotations: description: The CPU usage in {{ $labels.instance }} has been over 90% for more than 5 minutes. generic_summary: CPU usage too high summary: CPU usage in {{ $labels.instance }} is too high ok 9.533s ago 420.5us
alert: HighLoadAvg expr: node_load15{job="node"} > 100 for: 5m labels: severity: info annotations: description: The 15-minute load average in {{ $labels.instance }} has been over 100 for more than 5 minutes. generic_summary: Load average too high summary: The load average in {{ $labels.instance }} is too high ok 9.533s ago 185.4us
alert: MemFull expr: instance:node_memory_MemUsed_bytes_per_node_memory_MemTotal_bytes:ratio{job="node"} * 100 > 90 for: 15m labels: severity: info annotations: description: The memory usage in {{ $labels.instance }} has been over 90% for more than 15 minutes. generic_summary: Memory usage too high summary: Memory usage in {{ $labels.instance }} is too high ok 9.533s ago 164.1us
alert: MemFull expr: instance:node_memory_MemUsed_bytes_per_node_memory_MemTotal_bytes:ratio{job="node"} * 100 > 95 for: 5m labels: severity: warning annotations: description: The memory usage in {{ $labels.instance }} has been over 95% for more than 5 minutes. generic_summary: Memory usage critical summary: Memory usage in {{ $labels.instance }} is critical ok 9.534s ago 152.8us
alert: FSFull expr: instance:node_filesystem_avail_bytes_per_node_filesystem_size_bytes:ratio{job="node"} * 100 <= 1 for: 5m labels: severity: warning annotations: description: The {{ $labels.mountpoint }} filesystem in {{ $labels.instance }} has less than 5% available space. generic_summary: Filesystem almost full summary: Filesystem {{ $labels.mountpoint }} in {{ $labels.instance }} is almost full ok 9.534s ago 421.1us
alert: FSFull expr: instance:node_filesystem_avail_bytes_per_node_filesystem_size_bytes:ratio{job="node"} * 100 <= 0.5 for: 5m labels: severity: critical annotations: description: The {{ $labels.mountpoint }} filesystem in {{ $labels.instance }} is full. generic_summary: Filesystem full summary: Filesystem {{ $labels.mountpoint }} in {{ $labels.instance }} is full ok 9.534s ago 394.5us
alert: FSFullSoon expr: predict_linear(instance:node_filesystem_avail_bytes:sum{job="node"}[12h], 24 * 3600) <= 0 for: 5m labels: severity: info annotations: description: The {{ $labels.mountpoint }} filesystem in {{ $labels.instance }} will be full in 24 hours at the current rate. generic_summary: Filesystem full soon summary: Filesystem {{ $labels.mountpoint }} in {{ $labels.instance }} will fill soon ok 9.533s ago 10.67ms
alert: FSFullSoon expr: predict_linear(instance:node_filesystem_avail_bytes:sum{job="node"}[4h], 4 * 3600) <= 0 for: 30m labels: severity: warning annotations: description: The {{ $labels.mountpoint }} filesystem in {{ $labels.instance }} will be full in 4 hours at the current rate. generic_summary: Filesystem full VERY soon summary: Filesystem {{ $labels.mountpoint }} in {{ $labels.instance }} will fill VERY soon ok 9.523s ago 4.035ms
alert: MemFullSoon expr: predict_linear(instance:node_memory_MemUsed_bytes_per_node_memory_MemTotal_bytes:ratio{job="node"}[12h], 24 * 3600) * 100 > 99 for: 5m labels: severity: info annotations: description: The memory usage in {{ $labels.instance }} in {{ $labels.instance }} will reach 100% in 24 hours at the current rate. generic_summary: Memory full soon summary: Memory in {{ $labels.instance }} will fill in 24h ok 9.519s ago 2.998ms
alert: MemFullSoon expr: predict_linear(instance:node_memory_MemUsed_bytes_per_node_memory_MemTotal_bytes:ratio{job="node"}[8h], 4 * 3600) * 100 > 99 for: 30m labels: severity: warning annotations: description: The memory usage in {{ $labels.instance }} in {{ $labels.instance }} will reach 100% in 4 hours at the current rate. generic_summary: Memory full VERY soon summary: Memory in {{ $labels.instance }} will fill in 4h ok 9.516s ago 2.057ms
alert: ProcessNearFDLimits expr: process_open_fds / process_max_fds * 100 > 80 for: 5m labels: severity: warning annotations: description: The process for {{ $labels.job }} in {{ $labels.instance }} has {{ $value }}% of available file descriptors in use. generic_summary: Too many files open summary: The process in {{ $labels.instance }} has too many files open. ok 9.514s ago 514.2us

Cronjob rules

21.033s ago

918.2us

Rule State Error Last Evaluation Evaluation Time
record: batch_running_time_seconds expr: (((batch_last_finish_seconds - batch_last_start_seconds) > 0) or (time() - batch_last_start_seconds)) ok 21.033s ago 388.9us
record: batch_period_seconds expr: (batch_last_start_seconds{job=~"monthly.*"} * 0 + 3600 * 24 * 31) or (batch_last_start_seconds{job=~"weekly.*"} * 0 + 3600 * 24 * 7) or (batch_last_start_seconds{job=~"daily.*"} * 0 + 3600 * 24) or (batch_last_start_seconds * 0 + 3600 * 24 * 7) ok 21.033s ago 517us

Basic node rules

3.564s ago

27.36ms

Rule State Error Last Evaluation Evaluation Time
record: instance:node_memory_MemTotal_bytes:sum expr: node_memory_MemTotal_bytes{job="node"} ok 3.564s ago 367.4us
record: instance:node_memory_MemFree_bytes:sum expr: node_memory_MemFree_bytes{job="node"} ok 3.564s ago 180.5us
record: instance:node_memory_MemUsed_bytes:sum expr: node_memory_MemTotal_bytes{job="node"} - node_memory_MemFree_bytes{job="node"} - node_memory_Buffers_bytes{job="node"} - node_memory_Cached_bytes{job="node"} ok 3.564s ago 905us
record: instance:node_memory_MemUsed_bytes_per_node_memory_MemTotal_bytes:ratio expr: instance:node_memory_MemUsed_bytes:sum{job="node"} / instance:node_memory_MemTotal_bytes:sum{job="node"} ok 3.563s ago 369.9us
record: instance:node_cpu_seconds_total:rate5m expr: rate(node_cpu_seconds_total{job="node"}[5m]) ok 3.563s ago 6.862ms
record: instance:node_cpu_seconds_total:avg_rate5m expr: avg without (cpu) (rate(node_cpu_seconds_total{job="node"}[5m])) ok 3.556s ago 5.359ms
record: instance:node_network_receive_bytes_total:rate5m expr: rate(node_network_receive_bytes_total{job="node"}[5m]) ok 3.551s ago 711.5us
record: instance:node_network_receive_drop_total:rate5m expr: rate(node_network_receive_drop_total{job="node"}[5m]) ok 3.55s ago 621.5us
record: instance:node_network_receive_errs_total:rate5m expr: rate(node_network_receive_errs_total{job="node"}[5m]) ok 3.55s ago 711.5us
record: instance:node_network_receive_packets_total:rate5m expr: rate(node_network_receive_packets_total{job="node"}[5m]) ok 3.549s ago 760.8us
record: instance:node_network_transmit_bytes_total:rate5m expr: rate(node_network_transmit_bytes_total{job="node"}[5m]) ok 3.548s ago 688.1us
record: instance:node_network_transmit_drop_total:rate5m expr: rate(node_network_transmit_drop_total{job="node"}[5m]) ok 3.548s ago 700.4us
record: instance:node_network_transmit_errs_total:rate5m expr: rate(node_network_transmit_errs_total{job="node"}[5m]) ok 3.547s ago 601.3us
record: instance:node_network_transmit_packets_total:rate5m expr: rate(node_network_transmit_packets_total{job="node"}[5m]) ok 3.546s ago 692.2us
record: instance:node_disk_io_time_seconds_total:rate5m expr: rate(node_disk_io_time_seconds_total{job="node"}[5m]) ok 3.546s ago 900.9us
record: instance:node_disk_read_bytes_total:rate5m expr: rate(node_disk_read_bytes_total{job="node"}[5m]) ok 3.545s ago 819.1us
record: instance:node_disk_written_bytes_total:rate5m expr: rate(node_disk_written_bytes_total{job="node"}[5m]) ok 3.544s ago 851.1us
record: instance:node_filesystem_avail_bytes:sum expr: node_filesystem_avail_bytes{job="node"} ok 3.543s ago 488.2us
record: instance:node_filesystem_free_bytes:sum expr: node_filesystem_free_bytes{job="node"} ok 3.543s ago 476.6us
record: instance:node_filesystem_size_bytes:sum expr: node_filesystem_size_bytes{job="node"} ok 3.543s ago 468.1us
record: instance:node_filesystem_avail_bytes_per_node_filesystem_size_bytes:ratio expr: node_filesystem_avail_bytes{job="node"} / node_filesystem_size_bytes{job="node"} ok 3.542s ago 917.7us
record: instance:node_filesystem_free_bytes_per_node_filesystem_size_bytes:ratio expr: node_filesystem_free_bytes{job="node"} / node_filesystem_size_bytes{job="node"} ok 3.541s ago 1.01ms
record: instance:node_filesystem_files:sum expr: node_filesystem_files{job="node"} ok 3.54s ago 492.6us
record: instance:node_filesystem_files_free:sum expr: node_filesystem_files_free{job="node"} ok 3.54s ago 465.4us
record: instance:node_filesystem_files_free_per_node_filesystem_files:ratio expr: node_filesystem_files_free{job="node"} / node_filesystem_files{job="node"} ok 3.54s ago 874.6us