How I Added Slack Alerting to My Kubernetes Homelab
Omobayonle Ogundele
MAIN_NODE: DEVOPS_ENGINEER
Deploying to Kubernetes felt great. Two pods running, zero downtime
deployments, automatic restarts. Everything was working.
But there was one problem — I had no idea when things broke.
If a pod crashed at 3am I wouldn't know until I checked manually or
a friend told me the site was down. That's not how production systems
work. Production systems tell you when something is wrong before your
users notice.
So I added alerting.
The Problem With Not Having Alerts
Before alerting my workflow for detecting problems was:
- Someone tells me the site is down
- I SSH into the server
- I check
kubectl get pods - I figure out what broke
- I fix it
That gap between step 1 and step 5 is called Mean Time To Recovery (MTTR)
— one of the most important metrics in DevOps. Every minute your site is down
is a minute users are leaving. Alerting shrinks that gap dramatically.
The goal was simple: get notified on Slack the moment anything goes wrong.
The Stack
The alerting stack I built uses three components that are all part of the
Prometheus ecosystem — which I already had partially set up:
- Prometheus — already running, scraping metrics every 15 seconds
- Alertmanager — new addition, receives alerts from Prometheus and routes
them to the right place - Slack — where the notifications land
The flow looks like this:
Prometheus scrapes metrics
↓
Evaluates alert rules every 15s
↓
Fires alert to Alertmanager
↓
Alertmanager sends to Slack #alerts
Step 1 — Creating the Slack Webhook
First I needed a way for Alertmanager to post to Slack. Slack provides
Incoming Webhooks — a URL you POST to and Slack delivers the message
to a channel.
- Go to api.slack.com/apps
- Create a new app → From Scratch
- Name it
Homelab Alerts - Go to Incoming Webhooks → Activate
- Add a webhook to a channel (I created
#alerts) - Copy the webhook URL
The webhook URL looks like:
https://hooks.slack.com/services/xxx/yyy/zzz
That URL is all Alertmanager needs to post messages to Slack.
Step 2 — Configuring Alertmanager
I created the Alertmanager config at /etc/alertmanager/alertmanager.yml:
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/xxx/yyy/zzz'
route:
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: 'slack-alerts'
receivers:
- name: 'slack-alerts'
slack_configs:
- channel: '#alerts'
send_resolved: true
icon_emoji: ':bell:'
title: '{{ if eq .Status "firing" }}🔴 ALERT{{ else }}✅ RESOLVED{{ end }}: {{ .GroupLabels.alertname }}'
text: |
{{ range .Alerts }}
*Severity:* {{ .Labels.severity }}
*Description:* {{ .Annotations.description }}
*Started:* {{ .StartsAt | since }}
{{ end }}
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname']
A few things worth explaining here:
group_wait: 30s — Alertmanager waits 30 seconds before sending the
first alert. This prevents alert spam if multiple things fire at once.
repeat_interval: 1h — If an alert is still firing it resends every
hour. You don't get spammed every 15 seconds.
send_resolved: true — When the problem is fixed you get a green ✅
resolved message. This is important — you need to know when things recover
not just when they break.
inhibit_rules — If a critical alert fires it suppresses related
warning alerts. No point getting a warning about high CPU if you're already
getting a critical alert that the pod is down.
Step 3 — Writing the Alert Rules
Alert rules live in Prometheus and define the conditions that trigger alerts.
I created /etc/prometheus/alert-rules.yml with four categories of alerts:
Site Down
- alert: SiteDown
expr: up{job="portfolio"} == 0
for: 1m
labels:
severity: critical
annotations:
description: "Portfolio site has been down for more than 1 minute"
up == 0 means Prometheus can't scrape the target. If that's true for
more than 1 minute — alert fires.
Pod Crash Looping
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: critical
annotations:
description: "Pod {{ $labels.pod }} is crash looping in namespace {{ $labels.namespace }}"
If a pod is restarting continuously over 15 minutes — something is seriously
wrong. This catches the dreaded CrashLoopBackOff status automatically.
Pod Not Running
- alert: PodNotRunning
expr: kube_pod_status_phase{phase!="Running", namespace="portfolio"} == 1
for: 2m
labels:
severity: critical
annotations:
description: "Pod {{ $labels.pod }} is not running — current phase: {{ $labels.phase }}"
Catches pods stuck in Pending, Failed or Unknown states for more
than 2 minutes.
High CPU Usage
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
description: "CPU usage is above 80% on {{ $labels.instance }} — current value: {{ $value | printf \"%.1f\" }}%"
If CPU stays above 80% for 5 minutes something is consuming resources
abnormally. Warning severity — not critical but worth knowing about.
High Memory Usage
- alert: HighMemoryUsage
expr: (1 - (node_memory_AvailableBytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
description: "Memory usage is above 85% on {{ $labels.instance }} — current value: {{ $value | printf \"%.1f\" }}%"
Same pattern as CPU — sustained high memory usage for 5 minutes triggers
a warning.
Pipeline Failure
- alert: PipelineFailure
expr: increase(drone_build_count{status="failure"}[30m]) > 0
for: 1m
labels:
severity: warning
annotations:
description: "Drone CI pipeline failed in the last 30 minutes"
If any build fails in the last 30 minutes — get notified immediately.
No more wondering why the site didn't update after a push.
Step 4 — Adding Alertmanager to Docker Compose
I added Alertmanager as a new service in my monitoring docker-compose.yml:
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- /etc/prometheus/alert-rules.yml:/etc/prometheus/alert-rules.yml
ports:
- "9090:9090"
restart: always
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--web.enable-lifecycle'
alertmanager:
image: prom/alertmanager:latest
container_name: alertmanager
volumes:
- /etc/alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
ports:
- "9093:9093"
restart: always
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
node-exporter:
image: prom/node-exporter:latest
container_name: node-exporter
ports:
- "9100:9100"
restart: always
grafana:
image: grafana/grafana:latest
container_name: grafana
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
restart: always
And updated prometheus.yml to tell Prometheus where Alertmanager is and
where the alert rules live:
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
rule_files:
- /etc/prometheus/alert-rules.yml
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'portfolio'
static_configs:
- targets: ['129.146.31.124:31889']
Step 5 — Testing It
Before trusting a system like this in production I needed to verify it
actually worked. I fired a test alert manually using the Alertmanager API:
curl -X POST http://localhost:9093/api/v1/alerts \
-H "Content-Type: application/json" \
-d '[{
"labels": {
"alertname": "TestAlert",
"severity": "critical"
},
"annotations": {
"description": "This is a test alert from your homelab — alerting is working!"
}
}]'
Within 30 seconds a message appeared in #alerts on Slack:
🔴 ALERT: TestAlert
Severity: critical
Description: This is a test alert from your homelab — alerting is working!
Alerting confirmed working.
What I Now Get Notified About
| Alert | Condition | Severity | Response Time |
|---|---|---|---|
| Site Down | Prometheus can't reach the app | Critical | 1 minute |
| Pod Crash Looping | Pod restarting repeatedly | Critical | 5 minutes |
| Pod Not Running | Pod stuck in non-running state | Critical | 2 minutes |
| High CPU | CPU above 80% sustained | Warning | 5 minutes |
| High Memory | Memory above 85% sustained | Warning | 5 minutes |
| Pipeline Failure | Drone CI build failed | Warning | 1 minute |
What This Changed
Before alerting my infrastructure was a black box. Things could break and
I wouldn't know until someone told me or I happened to check.
After alerting the system tells me what's wrong before users notice. That
shift — from reactive to proactive — is one of the most important mindset
changes in DevOps.
The difference between a junior and senior DevOps engineer isn't just
knowing how to deploy things. It's knowing how to observe them, measure
them and get woken up when they break.
My Full Observability Stack
Metrics Collection → Prometheus (scrapes every 15s)
Visualization → Grafana (dashboards)
Alert Evaluation → Prometheus (checks rules every 15s)
Alert Routing → Alertmanager (deduplication, grouping)
Notification → Slack #alerts
What's Next
- PagerDuty integration for on-call rotation
- cert-manager for automatic SSL on Kubernetes
- ArgoCD for GitOps deployments
- Horizontal Pod Autoscaler to scale pods based on CPU
One thing at a time. 🚀
Everything documented here is running live on my homelab right now.
Follow the journey on Twitter or
connect on LinkedIn.
Omobayonle Ogundele
DevOps Engineer based in Lagos, Nigeria. Building reliable infrastructure and sharing logs from the edge of production.
Comments (0)
No comments yet. Be the first!