BACK TO ENTRIES
LOG_ENTRY: 2026.03.13 · 8 TELEMETRY_HITS

How I Added Slack Alerting to My Kubernetes Homelab

B

Omobayonle Ogundele

MAIN_NODE: DEVOPS_ENGINEER

Deploying to Kubernetes felt great. Two pods running, zero downtime
deployments, automatic restarts. Everything was working.

But there was one problem — I had no idea when things broke.

If a pod crashed at 3am I wouldn't know until I checked manually or
a friend told me the site was down. That's not how production systems
work. Production systems tell you when something is wrong before your
users notice.

So I added alerting.


The Problem With Not Having Alerts

Before alerting my workflow for detecting problems was:

  1. Someone tells me the site is down
  2. I SSH into the server
  3. I check kubectl get pods
  4. I figure out what broke
  5. I fix it

That gap between step 1 and step 5 is called Mean Time To Recovery (MTTR)
— one of the most important metrics in DevOps. Every minute your site is down
is a minute users are leaving. Alerting shrinks that gap dramatically.

The goal was simple: get notified on Slack the moment anything goes wrong.


The Stack

The alerting stack I built uses three components that are all part of the
Prometheus ecosystem — which I already had partially set up:

  • Prometheus — already running, scraping metrics every 15 seconds
  • Alertmanager — new addition, receives alerts from Prometheus and routes
    them to the right place
  • Slack — where the notifications land

The flow looks like this:

Prometheus scrapes metrics
        ↓
Evaluates alert rules every 15s
        ↓
Fires alert to Alertmanager
        ↓
Alertmanager sends to Slack #alerts

Step 1 — Creating the Slack Webhook

First I needed a way for Alertmanager to post to Slack. Slack provides
Incoming Webhooks — a URL you POST to and Slack delivers the message
to a channel.

  1. Go to api.slack.com/apps
  2. Create a new app → From Scratch
  3. Name it Homelab Alerts
  4. Go to Incoming Webhooks → Activate
  5. Add a webhook to a channel (I created #alerts)
  6. Copy the webhook URL

The webhook URL looks like:

https://hooks.slack.com/services/xxx/yyy/zzz

That URL is all Alertmanager needs to post messages to Slack.


Step 2 — Configuring Alertmanager

I created the Alertmanager config at /etc/alertmanager/alertmanager.yml:

global:
  resolve_timeout: 5m
  slack_api_url: 'https://hooks.slack.com/services/xxx/yyy/zzz'

route:
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: 'slack-alerts'

receivers:
  - name: 'slack-alerts'
    slack_configs:
      - channel: '#alerts'
        send_resolved: true
        icon_emoji: ':bell:'
        title: '{{ if eq .Status "firing" }}🔴 ALERT{{ else }}✅ RESOLVED{{ end }}: {{ .GroupLabels.alertname }}'
        text: |
          {{ range .Alerts }}
          *Severity:* {{ .Labels.severity }}
          *Description:* {{ .Annotations.description }}
          *Started:* {{ .StartsAt | since }}
          {{ end }}

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname']

A few things worth explaining here:

group_wait: 30s — Alertmanager waits 30 seconds before sending the
first alert. This prevents alert spam if multiple things fire at once.

repeat_interval: 1h — If an alert is still firing it resends every
hour. You don't get spammed every 15 seconds.

send_resolved: true — When the problem is fixed you get a green ✅
resolved message. This is important — you need to know when things recover
not just when they break.

inhibit_rules — If a critical alert fires it suppresses related
warning alerts. No point getting a warning about high CPU if you're already
getting a critical alert that the pod is down.


Step 3 — Writing the Alert Rules

Alert rules live in Prometheus and define the conditions that trigger alerts.
I created /etc/prometheus/alert-rules.yml with four categories of alerts:

Site Down

- alert: SiteDown
  expr: up{job="portfolio"} == 0
  for: 1m
  labels:
    severity: critical
  annotations:
    description: "Portfolio site has been down for more than 1 minute"

up == 0 means Prometheus can't scrape the target. If that's true for
more than 1 minute — alert fires.

Pod Crash Looping

- alert: PodCrashLooping
  expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
  for: 5m
  labels:
    severity: critical
  annotations:
    description: "Pod {{ $labels.pod }} is crash looping in namespace {{ $labels.namespace }}"

If a pod is restarting continuously over 15 minutes — something is seriously
wrong. This catches the dreaded CrashLoopBackOff status automatically.

Pod Not Running

- alert: PodNotRunning
  expr: kube_pod_status_phase{phase!="Running", namespace="portfolio"} == 1
  for: 2m
  labels:
    severity: critical
  annotations:
    description: "Pod {{ $labels.pod }} is not running  current phase: {{ $labels.phase }}"

Catches pods stuck in Pending, Failed or Unknown states for more
than 2 minutes.

High CPU Usage

- alert: HighCPUUsage
  expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
  for: 5m
  labels:
    severity: warning
  annotations:
    description: "CPU usage is above 80% on {{ $labels.instance }}  current value: {{ $value | printf \"%.1f\" }}%"

If CPU stays above 80% for 5 minutes something is consuming resources
abnormally. Warning severity — not critical but worth knowing about.

High Memory Usage

- alert: HighMemoryUsage
  expr: (1 - (node_memory_AvailableBytes / node_memory_MemTotal_bytes)) * 100 > 85
  for: 5m
  labels:
    severity: warning
  annotations:
    description: "Memory usage is above 85% on {{ $labels.instance }}  current value: {{ $value | printf \"%.1f\" }}%"

Same pattern as CPU — sustained high memory usage for 5 minutes triggers
a warning.

Pipeline Failure

- alert: PipelineFailure
  expr: increase(drone_build_count{status="failure"}[30m]) > 0
  for: 1m
  labels:
    severity: warning
  annotations:
    description: "Drone CI pipeline failed in the last 30 minutes"

If any build fails in the last 30 minutes — get notified immediately.
No more wondering why the site didn't update after a push.


Step 4 — Adding Alertmanager to Docker Compose

I added Alertmanager as a new service in my monitoring docker-compose.yml:

version: '3.8'
services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - /etc/prometheus/alert-rules.yml:/etc/prometheus/alert-rules.yml
    ports:
      - "9090:9090"
    restart: always
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--web.enable-lifecycle'

  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    volumes:
      - /etc/alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
    ports:
      - "9093:9093"
    restart: always
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'

  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    ports:
      - "9100:9100"
    restart: always

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    restart: always

And updated prometheus.yml to tell Prometheus where Alertmanager is and
where the alert rules live:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

rule_files:
  - /etc/prometheus/alert-rules.yml

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'portfolio'
    static_configs:
      - targets: ['129.146.31.124:31889']

Step 5 — Testing It

Before trusting a system like this in production I needed to verify it
actually worked. I fired a test alert manually using the Alertmanager API:

curl -X POST http://localhost:9093/api/v1/alerts \
  -H "Content-Type: application/json" \
  -d '[{
    "labels": {
      "alertname": "TestAlert",
      "severity": "critical"
    },
    "annotations": {
      "description": "This is a test alert from your homelab — alerting is working!"
    }
  }]'

Within 30 seconds a message appeared in #alerts on Slack:

🔴 ALERT: TestAlert
Severity: critical
Description: This is a test alert from your homelab — alerting is working!

Alerting confirmed working.


What I Now Get Notified About

Alert Condition Severity Response Time
Site Down Prometheus can't reach the app Critical 1 minute
Pod Crash Looping Pod restarting repeatedly Critical 5 minutes
Pod Not Running Pod stuck in non-running state Critical 2 minutes
High CPU CPU above 80% sustained Warning 5 minutes
High Memory Memory above 85% sustained Warning 5 minutes
Pipeline Failure Drone CI build failed Warning 1 minute

What This Changed

Before alerting my infrastructure was a black box. Things could break and
I wouldn't know until someone told me or I happened to check.

After alerting the system tells me what's wrong before users notice. That
shift — from reactive to proactive — is one of the most important mindset
changes in DevOps.

The difference between a junior and senior DevOps engineer isn't just
knowing how to deploy things. It's knowing how to observe them, measure
them and get woken up when they break.


My Full Observability Stack

Metrics Collection  →  Prometheus (scrapes every 15s)
Visualization       →  Grafana (dashboards)
Alert Evaluation    →  Prometheus (checks rules every 15s)
Alert Routing       →  Alertmanager (deduplication, grouping)
Notification        →  Slack #alerts

What's Next

  • PagerDuty integration for on-call rotation
  • cert-manager for automatic SSL on Kubernetes
  • ArgoCD for GitOps deployments
  • Horizontal Pod Autoscaler to scale pods based on CPU

One thing at a time. 🚀


Everything documented here is running live on my homelab right now.
Follow the journey on Twitter or
connect on LinkedIn.

B

Omobayonle Ogundele

DevOps Engineer based in Lagos, Nigeria. Building reliable infrastructure and sharing logs from the edge of production.

Comments (0)

No comments yet. Be the first!

LEAVE_RESPONSE