Building Production-Grade DevOps Infrastructure: A Complete Journey
Omobayonle Ogundele
MAIN_NODE: DEVOPS_ENGINEER
A detailed account of building a complete DevOps platform from scratch over three weeks, including every problem encountered and solution implemented.
Table of Contents
- The Beginning: A 4GB Laptop and a Dream
- Week 1: Building the Foundation
- Week 2: The CI/CD Pipeline From Hell
- Week 3: Going to Production
- The Complete Problem Log
- What I Actually Learned
- The Final Result
The Beginning: Project Objectives {#the-beginning}
Three weeks ago, I began a project to build a complete DevOps infrastructure platform from scratch. The goal was to move beyond theoretical knowledge from tutorials and documentation to hands-on implementation of production-grade systems.
Starting Resources:
- HP Stream laptop (4GB RAM) - initial testing environment
- HP EliteBook 840 G2 (8GB RAM) - main development machine
- Foundational knowledge of Docker and containerization
- Ubuntu 24.04 as the base operating system
Project Objectives:
The goal was to build enterprise-grade infrastructure components:
- Self-hosted Git server for version control
- Automated CI/CD pipeline for continuous integration and deployment
- Private container registry for image management
- Centralized secrets management
- Comprehensive monitoring and logging infrastructure
- Production deployment to cloud infrastructure
- Complete automation from code commit to production
This project aimed to demonstrate the ability to design, implement, and troubleshoot complex infrastructure systems.
Day 0: The Hardware Wake-Up Call
I started by spinning up Prometheus, Grafana, and Gitea on my HP Stream laptop.
The laptop immediately became a bottleneck.
Containers took 2-3 minutes to start. The system was constantly swapping to disk. Running multiple services simultaneously brought the machine to a crawl. I'd start docker-compose up and wait several minutes for all services to initialize.
This was my first lesson: hardware matters.
Fortunately, I had an HP EliteBook 840 G2 with 8GB RAM that I use for my main work. I decided to use it for the homelab project as well. The difference was immediate and dramatic.
Containers started in seconds. I could run 10+ services simultaneously without performance issues. The development workflow became significantly smoother.
Lesson: Adequate hardware is essential for running containerized infrastructure. 8GB RAM should be considered the minimum for multi-container development environments.
Week 1: Building the Foundation {#week-1-foundation}
Day 1: The Monitoring Stack
I started with what every good DevOps engineer starts with: observability.
You can't fix what you can't see. So before building anything else, I needed to be able to monitor it.
The plan:
- Prometheus for metrics collection
- Grafana for visualization
- Node Exporter for system metrics
I created my project structure:
homelab/
├── monitoring/
│ ├── prometheus/
│ │ └── prometheus.yml
│ ├── grafana/
│ │ └── provisioning/
│ └── docker-compose.yml
The Prometheus config was straightforward:
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
Ran docker-compose up -d.
It worked. First try.
I should have known it wouldn't last.
Day 2: The Git Server (First Real Problem)
Gitea is a self-hosted Git service. Think GitHub, but running on your own infrastructure.
I spun it up with PostgreSQL as the database:
services:
gitea-db:
image: postgres:15-alpine
environment:
POSTGRES_PASSWORD: gitea123
gitea:
image: gitea/gitea:latest
environment:
GITEA__database__PASSWD: gitea123
Started it. Gitea loaded. Perfect.
Then I made a change to the configuration and restarted everything.
Error:
pq: password authentication failed for user "gitea"
Wait, what? I didn't change the password.
I spent 2 hours trying different things:
- Checked the compose file ✓ (passwords matched)
- Restarted containers ✗ (same error)
- Checked logs ✗ (database running fine)
- Googled the error ✗ (generic solutions didn't help)
Finally, it hit me. Docker volumes persist data.
The first time I ran Gitea, Docker created a volume with the database. When I changed the password in the compose file and restarted, the new password was in the environment variables, but the old password was still in the PostgreSQL database stored in the volume.
The fix:
docker-compose down -v # The -v removes volumes
docker-compose up -d
It worked immediately.
Lesson learned: Docker volumes are persistent storage. If you need a fresh start, you need to explicitly remove them.
Time wasted: 2 hours
Value of lesson: Saved me countless hours in the future
Day 3: The Nested Repository Nightmare
I created a repository in Gitea for my sample application. Cloned it to my laptop. Started adding code.
Then I tried to commit everything to my main homelab repository:
git add .
git commit -m "Add sample app"
Error:
warning: adding embedded git repository: homelab/apps/sample-app
hint: You've added another git repository inside your git repository.
Oh no.
I'd cloned the sample-app repo instead of just copying the files. So now I had a Git repo inside a Git repo. Git doesn't like that.
This one was quick to fix:
cd homelab/apps/sample-app
rm -rf .git
cd ~/homelab
git add .
git commit -m "Add sample app files"
Lesson learned: Don't clone repos into other repos. Copy files or use Git submodules.
Time wasted: 30 minutes
Frustration level: Moderate
Days 4-7: Monitoring Everything
The rest of Week 1 was smoother. I:
- Set up Grafana dashboards
- Imported the Node Exporter dashboard (ID: 1860)
- Configured Prometheus to scrape all my services
- Added Loki for centralized logging
- Set up Promtail to ship logs
By the end of Week 1, I had complete observability. I could see CPU usage, memory, disk I/O, network traffic, and logs from all containers.
It felt good. Everything was working.
I thought I was ready for CI/CD.
I was so, so wrong.
Week 2: The CI/CD Pipeline From Hell {#week-2-cicd}
Week 2 nearly broke me. This is where I learned that DevOps is 90% debugging and 10% configuration.
Day 8: Setting Up Drone CI
Drone is a container-native CI/CD platform. Every build runs in a fresh Docker container. It integrates with Gitea via OAuth.
First, I needed to create an OAuth application in Gitea:
- Settings → Applications → Create OAuth2 Application
- Name: Drone CI
- Redirect URI: http://localhost:8080/login
Gitea gave me a Client ID and Client Secret. I generated an RPC secret:
openssl rand -hex 16
Then I configured Drone:
services:
drone-server:
image: drone/drone:2
environment:
DRONE_GITEA_SERVER: http://localhost:3001
DRONE_GITEA_CLIENT_ID: [client-id]
DRONE_GITEA_CLIENT_SECRET: [client-secret]
DRONE_RPC_SECRET: [rpc-secret]
Started it. Opened http://localhost:8080.
OAuth redirect worked. I authorized with Gitea. Logged in.
Success!
Or so I thought.
Day 9: The localhost That Wasn't localhost
I activated my repository in Drone. Created a simple .drone.yml:
kind: pipeline
type: docker
name: default
steps:
- name: test
image: node:18-alpine
commands:
- npm install
- npm test
Committed it. Pushed to Gitea.
The build triggered! Progress!
Then it failed:
fatal: unable to access 'http://localhost:3001/bayo/sample-app.git/':
Failed to connect to localhost port 3001: Connection refused
But... Gitea was running on localhost:3001. I could access it in my browser. I could clone the repo from my terminal. What was going on?
I spent the next 3 hours in a debugging spiral.
Attempt 1: Maybe it needs host.docker.internal instead of localhost?
DRONE_GITEA_SERVER: http://host.docker.internal:3001
Result:
ERR_NAME_NOT_RESOLVED: host.docker.internal's server IP could not be found
Stack Overflow said to use host.docker.internal. Every tutorial used it. Why wasn't it working?
Because I'm on Linux. host.docker.internal is a Docker Desktop (Mac/Windows) feature. It doesn't exist on Linux.
Attempt 2: Try 127.0.0.1 instead?
DRONE_GITEA_SERVER: http://127.0.0.1:3001
Result:
Failed to connect to 127.0.0.1 port 3001: Connection refused
Same error. What the hell is happening?
Attempt 3: Disable the firewall entirely?
sudo ufw disable
Result: Still connection refused.
At this point I was frustrated. I'd been at this for 2.5 hours. I decided to do something I should have done from the beginning:
I actually read the Docker networking documentation.
Buried in the docs, I found this sentence:
"From the perspective of a container, localhost refers to the container itself, not the host machine."
Wait.
WAIT.
When the Drone container tries to connect to localhost:3001, it's looking inside itself for a Git server. Of course it can't find it. Gitea is running on my host machine, not inside the Drone container!
The solution: use the Docker bridge IP.
ip addr show docker0
inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
There it was. 172.17.0.1 - the IP address of the host machine as seen from inside containers.
DRONE_GITEA_SERVER: http://172.17.0.1:3001
Restarted Drone. Tried again.
It worked.
After 3 hours, the fix was changing one IP address.
Lesson learned: Containers are isolated. localhost inside a container means the container itself. Use the bridge IP (172.17.0.1) to reach the host from containers.
Time wasted: 3 hours
Knowledge gained: Fundamental understanding of Docker networking
Would I trade it: Absolutely not. This lesson was invaluable.
Day 10: YAML, My Greatest Enemy
With Drone connecting to Gitea, I started writing my actual build pipeline.
I wanted:
1. Clone the repository
2. Run tests
3. Build a Docker image
4. Push to Harbor (my private registry)
Here's what I wrote:
kind: pipeline
type: docker
name: default
steps:
- name: test
image: node:18-alpine
commands:
- npm install
- npm test
- name: build
image: plugins/docker
settings:
registry: 172.16.18.128:8888
repo: 172.16.18.128:8888/library/sample-app
tags: latest
username: admin
password:
from_secret: harbor_password
Committed. Pushed. Watched the build start.
Error:
yaml: unmarshal errors:
line 31: cannot unmarshal !!map into string
Line 31. Let me check line 31. That's... tags: latest.
What could possibly be wrong with tags: latest? It's just a string!
I added quotes: tags: "latest"
Same error.
I checked my indentation. Everything was aligned with 2 spaces. Consistent throughout.
I ran yamllint:
yamllint .drone.yml
It complained about line length but said nothing about structure.
I compared with examples in the Drone docs. My file looked identical.
One hour passed.
I was starting to lose my mind. This was a simple YAML file. What was I missing?
I decided to rewrite the entire file from scratch. Opened a new file. Started typing from memory.
When I got to the Docker build step, I wrote:
settings:
tags:
- latest
Wait. Why did I write it like that?
I looked at my broken file:
settings:
tags: latest
And then I looked at the Drone docs more carefully:
settings:
tags:
- latest
- ${DRONE_COMMIT_SHA}
Oh.
Oh no.
The Docker plugin expects tags to be a list, not a string.
In YAML:
- tags: latest means "tags is a string with value 'latest'"
- tags:\n - latest means "tags is a list containing one string 'latest'"
The plugin was trying to unmarshal a string where it expected a list. That's what the error meant.
The fix:
tags:
- latest
- build-${DRONE_BUILD_NUMBER}
Saved. Committed. Pushed. Held my breath.
The build succeeded.
Three. Hours. One. Space.
Actually, one dash. But you get the point.
Lesson learned: YAML is whitespace-sensitive and data-type-sensitive. A string is not a list. A list is not a map. There's no "close enough" in YAML. It either matches the expected structure exactly, or it fails.
Time wasted: 3+ hours
New fear unlocked: YAML
Fun fact: I now validate all YAML with yamllint and python -c "import yaml; yaml.safe_load(open('file.yml'))" before deploying
Day 11-12: Harbor Registry
With Drone working, I needed somewhere to push Docker images. Enter Harbor - an open-source container registry with vulnerability scanning.
I downloaded the installer:
wget https://github.com/goharbor/harbor/releases/download/v2.10.0/harbor-online-installer-v2.10.0.tgz
tar xzvf harbor-online-installer-v2.10.0.tgz
cd harbor
Configured harbor.yml:
- hostname: 172.16.18.128
- http port: 8888
- admin password: Harbor12345
Ran the installer:
sudo ./install.sh
Ten minutes later, Harbor was running. 10 containers, all humming along. Trivy scanner included for vulnerability detection.
I created a project called library and configured Drone to push there.
Updated my pipeline:
- name: build
image: plugins/docker
settings:
registry: 172.16.18.128:8888
repo: 172.16.18.128:8888/library/sample-app
tags:
- latest
- build-${DRONE_BUILD_NUMBER}
username: admin
password:
from_secret: harbor_password
insecure: true # Self-signed cert
Pushed code. Watched the build. Held my breath.
The image built. The push succeeded. I checked Harbor.
There it was. My first automated Docker build, sitting in my private registry.
This felt amazing.
Day 13-14: The Webhook Mystery
Everything was working... when I manually triggered builds in Drone.
But when I pushed code to Gitea, nothing happened.
The webhook existed in Gitea settings. Test delivery showed "200 OK". But real pushes triggered nothing.
I checked Drone logs. No webhook received.
I checked the webhook URL:
http://localhost:8080/hook
Localhost again. Of course.
Changed it to:
http://172.17.0.1:8080/hook
Pushed code. Build triggered!
But wait. It still failed. Different error now:
Failed to clone repository. Branch 'main' not found.
My code was on the master branch. My pipeline was watching main.
Modern Git uses main as the default. Older repos use master. My Gitea installation defaulted to master.
Updated the pipeline:
when:
branch:
- main
- master
Pushed again.
Success! The build triggered automatically and completed.
Lesson learned: Webhooks need exact URLs (no localhost), and branch naming matters. Test thoroughly.
Time wasted: 2 hours
Patience remaining: Running low
Week 3: Going to Production {#week-3-production}
Day 15-16: Setting Up Oracle Cloud
I had everything working on my homelab. Now I needed to deploy to an actual cloud server.
I chose Oracle Cloud's free tier:
- VM instance (AMD shape)
- 1 OCPU, 6GB RAM
- Ubuntu 24.04
- Free forever
I used Ansible to set it up:
---
- name: Setup Docker on Oracle Cloud
hosts: oracle
become: yes
tasks:
- name: Install Docker
apt:
name:
- docker.io
- docker-compose
state: present
update_cache: yes
- name: Configure insecure registries
copy:
content: |
{
"insecure-registries": ["172.16.18.128:8888"]
}
dest: /etc/docker/daemon.json
Then I set up a WireGuard VPN tunnel between my homelab and Oracle Cloud:
Homelab (10.0.0.1):
[Interface]
PrivateKey = [homelab-private-key]
Address = 10.0.0.1/24
ListenPort = 51820
[Peer]
PublicKey = [oracle-public-key]
AllowedIPs = 10.0.0.2/32
Oracle Cloud (10.0.0.2):
[Interface]
PrivateKey = [oracle-private-key]
Address = 10.0.0.2/24
[Peer]
PublicKey = [homelab-public-key]
AllowedIPs = 10.0.0.1/32
Endpoint = [my-home-ip]:51820
Started WireGuard on both sides:
sudo wg-quick up wg0
Tested:
ping 10.0.0.2 # From homelab
ping 10.0.0.1 # From cloud
Both worked! Secure tunnel established.
Day 17: The Clone Problem (Again)
I added a deploy step to my Drone pipeline:
- name: deploy
image: appleboy/drone-ssh
settings:
host: 10.0.0.2
username: ubuntu
key:
from_secret: oracle_ssh_key
script:
- docker pull 10.0.0.1:8888/library/sample-app:latest
- docker stop sample-app || true
- docker rm sample-app || true
- docker run -d --name sample-app -p 5000:3000 10.0.0.1:8888/library/sample-app:latest
But the default clone step was still using localhost. And I'd learned that using localhost in containers doesn't work.
Solution: disable the default clone and write my own:
clone:
disable: true
steps:
- name: clone
image: alpine/git
commands:
- git clone http://172.17.0.1:3001/bayo/sample-app.git .
- git checkout $DRONE_COMMIT
Pushed code. The pipeline ran. The deploy step executed.
I checked the Oracle Cloud server:
docker ps
There it was. My application. Running. In the cloud. Deployed automatically.
I opened my browser: http://[oracle-ip]:5000
It loaded.
My first automated deployment to the cloud.
Day 18: Building the Portfolio Website
With the pipeline working, I built my actual portfolio website:
- Flask backend
- SQLAlchemy ORM
- PostgreSQL database
- Jinja2 templates
- TailwindCSS for styling
Created a proper Dockerfile:
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 5000
CMD ["python", "run.py"]
Note that CMD. This caused me an hour of debugging later.
Day 19: The Container Crash Loop
Deployed the portfolio website. The pipeline succeeded. Image pushed to Harbor. Deployment to cloud executed.
I checked the logs:
docker logs portfolio
Failed to find attribute 'app' in 'app'.
Worker (pid:7) exited with code 4
App failed to load.
The container kept crashing and restarting.
I'd initially written:
CMD ["gunicorn", "--bind", "0.0.0.0:5000", "app:app"]
Gunicorn was looking for:
- A file named app.py
- With a variable named app
My project structure:
portfolio/
├── app/
│ ├── __init__.py
│ └── routes/
├── run.py
└── Dockerfile
No app.py file. The entry point was run.py.
Changed the Dockerfile:
CMD ["python", "run.py"]
Pushed. Deployed. Checked logs.
It worked.
Lesson learned: Container entry points must match actual code structure. Check what files you actually have!
Time wasted: 1 hour
Obvious in hindsight: Extremely
Day 20: The Firewall Surprise
Container running. Logs clean. Everything perfect.
Opened browser: http://[oracle-ip]
This site can't be reached
Connection refused
But the container was running! I could curl it from inside the VM!
# From Oracle Cloud VM
curl http://localhost:80
# Works!
# From my laptop
curl http://[oracle-ip]
# Connection refused
Firewall.
Oracle Cloud blocks ALL traffic by default. I needed to open port 80 in two places:
1. VM iptables:
sudo iptables -I INPUT 6 -m state --state NEW -p tcp --dport 80 -j ACCEPT
sudo netfilter-persistent save
2. Oracle Cloud Security List:
- Compute → Instances → Subnet → Security List
- Add Ingress Rule:
- Source: 0.0.0.0/0
- Protocol: TCP
- Port: 80
Saved the rule. Waited 30 seconds.
Tried again from my laptop:
curl http://[oracle-ip]
It worked!
Opened browser. My portfolio website loaded. From the internet. Deployed automatically via my CI/CD pipeline.
This was the moment everything clicked.
Day 21: Welcome to the Internet
I was proud. My website was live. I'd built the entire infrastructure myself.
I decided to check the logs to make sure everything was running smoothly:
docker logs portfolio --tail 50
What I saw made my stomach drop:
89.248.168.239 - - [04/Mar/2026 21:20:35] "POST /vendor/phpunit/phpunit/src/Util/PHP/eval-stdin.php HTTP/1.1" 404 -
89.248.168.239 - - [04/Mar/2026 21:20:36] "POST /public/vendor/phpunit/phpunit/src/Util/PHP/eval-stdin.php HTTP/1.1" 404 -
89.248.168.239 - - [04/Mar/2026 21:20:37] "POST /web/vendor/phpunit/phpunit/src/Util/PHP/eval-stdin.php HTTP/1.1" 404 -
89.248.168.239 - - [04/Mar/2026 21:20:38] "POST /site/vendor/phpunit/phpunit/src/Util/PHP/eval-stdin.php HTTP/1.1" 404 -
[hundreds more lines...]
My website had been live for 5 minutes and was already under attack.
Automated scanners were probing for:
- PHPUnit vulnerabilities
- WordPress exploits
- SQL injection points
- Path traversal vulnerabilities
- Directory listing
- Old framework versions
All of them returned 404. My Flask app doesn't have any of those paths. I wasn't vulnerable.
But that was luck, not planning.
I'd deployed to production without:
- Firewall rules (beyond port opening)
- Rate limiting
- Fail2Ban
- HTTPS
- Security headers
- Any real hardening
The attacks taught me something crucial: the internet is hostile by default.
Every public IP address is being scanned constantly. Automated bots probe for any weakness. The question isn't if you'll be attacked. It's when, and whether you'll be vulnerable when it happens.
I immediately:
1. Enabled UFW:
sudo ufw allow 22/tcp
sudo ufw allow 80/tcp
sudo ufw allow 443/tcp
sudo ufw enable
2. Installed Fail2Ban:
sudo apt install fail2ban
sudo systemctl enable fail2ban
3. Added monitoring alerts:
- alert: HighErrorRate
expr: rate(http_requests_total{status="404"}[5m]) > 100
annotations:
summary: "High rate of 404 errors - possible attack"
4. Planned for HTTPS:
- Get domain name
- Configure Let's Encrypt
- Add SSL certificate
- Enable HSTS
Lesson learned: Security is not optional. It's not something you add later. The moment your server is public, the attacks start. Be ready from day one.
Time to first attack: 5 minutes
Number of attack attempts in first hour: 500+
Vulnerability exploitation success rate: 0% (but only by luck)
The Complete Problem Log {#complete-problems}
Here's every problem I faced, in chronological order:
Problem 1: Hardware Constraints
- Issue: 4GB RAM insufficient for multiple containers
- Symptom: Extreme slowness, constant swapping, 2-3 minute container startup times
- Solution: Switched to HP EliteBook 840 G2 with 8GB RAM (main work laptop)
- Time: Initial testing showed the issue, migration completed in under an hour
- Lesson: Adequate hardware is essential for containerized development; 8GB RAM minimum recommended
Problem 2: Database Password Mismatch
- Error:
pq: password authentication failed - Cause: Docker volumes persisted old password
- Solution:
docker-compose down -v - Time: 2 hours
- Lesson: Volumes persist data across restarts
Problem 3: Nested Git Repositories
- Error:
adding embedded git repository - Cause: Cloned repo had its own .git folder
- Solution:
rm -rf .gitin nested directory - Time: 30 minutes
- Lesson: Don't nest git repos
Problem 4: Docker Networking - localhost
- Error:
Connection refusedto localhost:3001 - Cause: localhost inside container ≠ host machine
- Solution: Used Docker bridge IP (172.17.0.1)
- Time: 3 hours
- Lesson: Containers are isolated
Problem 5: YAML Unmarshal Errors
- Error:
cannot unmarshal !!map into string - Cause:
tags: latestshould be list format - Solution:
tags:\n - latest - Time: 3+ hours
- Lesson: YAML data types are strict
Problem 6: Webhook Not Triggering
- Issue: Builds didn't start on git push
- Causes: localhost URL, branch name mismatch
- Solution: Bridge IP, watch both main and master
- Time: 2 hours
- Lesson: Test webhooks thoroughly
Problem 7: Drone Runner Connection
- Error:
http: no content returned - Cause: Runner using localhost instead of service name
- Solution:
DRONE_RPC_HOST: drone-server - Time: 1 hour
- Lesson: Use service names for container communication
Problem 8: Git Clone in Pipeline
- Error:
Connection refusedduring clone - Cause: Default clone using localhost
- Solution: Custom clone step with bridge IP
- Time: 1.5 hours
- Lesson: Override defaults when needed
Problem 9: CI Authentication
- Error:
terminal prompts disabled - Cause: CI can't do interactive auth
- Solution: Made repo public / used tokens
- Time: 1 hour
- Lesson: CI/CD needs non-interactive auth
Problem 10: Container Crash Loop
- Error:
Failed to find attribute 'app' - Cause: CMD didn't match code structure
- Solution: Changed to
python run.py - Time: 1 hour
- Lesson: Entry point must match actual files
Problem 11: Cloud Firewall
- Issue: Can't access from internet
- Cause: Oracle Cloud blocks all traffic by default
- Solution: Opened port 80 in security list + iptables
- Time: 1.5 hours
- Lesson: Cloud firewalls are locked down
Bonus: Immediate Attacks
- Discovery: Attacks started within 5 minutes
- Types: PHPUnit, WordPress, SQL injection, path traversal
- Result: All blocked (404)
- Action: Added firewall, Fail2Ban, monitoring
- Lesson: Internet is hostile, security is mandatory
Total Debugging Time: ~18 hours (37% of project)
Problems Solved: 11/11 (100% success rate)
Most Valuable Lessons: Docker networking, YAML syntax, security awareness
What I Actually Learned {#lessons-learned}
Technical Skills
Docker & Containers:
- Container networking (bridge IPs, service names)
- Volume management and persistence
- Multi-container orchestration
- Image building and optimization
- Registry management
CI/CD:
- Pipeline design and debugging
- Webhook configuration
- Secret management
- Automated deployments
- Zero-downtime updates
Infrastructure:
- Linux server administration
- Firewall configuration
- VPN tunnels (WireGuard)
- Cloud platform basics
- Monitoring and logging
Configuration Management:
- YAML syntax (the hard way)
- Docker Compose
- Environment variables
- Configuration as code
Security:
- Firewall rules (iptables, UFW)
- Fail2Ban setup
- Attack pattern recognition
- Security-first thinking
Soft Skills
Problem Solving:
- Systematic debugging approach
- Reading error messages carefully
- Testing one variable at a time
- Knowing when to start over
Persistence:
- 3-hour debug sessions without giving up
- Trying multiple approaches
- Learning from each failure
- Celebrating small wins
Documentation:
- Writing down solutions immediately
- Creating reference guides
- Explaining complex topics simply
- Sharing knowledge with others
Learning:
- Reading official documentation first
- Understanding fundamentals before shortcuts
- Building mental models
- Connecting concepts across tools
Philosophy Changes
Before: "I need to watch more tutorials before I start."
After: "The only way to learn is by building and breaking things."
Before: "This error is impossible to debug."
After: "This error has a logical cause. I just need to find it."
Before: "DevOps is about knowing all the tools."
After: "DevOps is about understanding how systems work and debugging when they don't."
Before: "I'll add security later when I have time."
After: "Security is built-in from day one, not bolted on later."
The Final Result {#final-result}
What I Built
Infrastructure Components:
- Prometheus (metrics collection)
- Grafana (visualization and dashboards)
- Node Exporter (system metrics)
- Gitea (self-hosted Git server)
- PostgreSQL (database for Gitea)
- Drone CI Server (CI/CD orchestration)
- Drone Runner (build execution)
- Harbor (container registry with 10 internal containers)
- Vault (secrets management)
- Loki (log aggregation)
- Promtail (log shipping)
- WireGuard (VPN tunnel)
- Portfolio Website (Flask application)
Total Containers: 18+
Total Services: 13
Lines of YAML: 500+
Configuration Files: 20+
Workflow
The complete automated workflow:
Developer (me):
↓
git commit -m "New feature"
git push origin main
↓
Gitea webhook triggers Drone CI
↓
Drone Runner starts build
↓
1. Clone code (using bridge IP)
2. Run tests (npm test)
3. Build Docker image
4. Push to Harbor registry
↓
SSH to Oracle Cloud via WireGuard VPN
↓
1. Pull image from Harbor (via 10.0.0.1:8888)
2. Stop old container
3. Remove old container
4. Start new container
5. Health check verification
↓
Logs shipped to Loki
Metrics sent to Prometheus
Alerts configured in Grafana
↓
Website live at http://[oracle-ip]
↓
Total time: 3 minutes
Manual steps: 0
Statistics
Time Investment:
- Week 1 (Monitoring): 8 hours
- Week 2 (CI/CD): 12 hours
- Week 3 (Cloud): 10 hours
- Debugging: 15 hours
- Documentation: 3 hours
- Total: ~48 hours over 3 weeks
Problems Encountered: 11 major issues
Solutions Found: 11 (100% success rate)
Tutorials Watched: 5
Documentation Pages Read: 50+
Stack Overflow Visits: Too many to count
"It works!" Moments: Priceless
Current Status:
- Infrastructure: Running smoothly
- Deployments: Fully automated
- Monitoring: Complete visibility
- Security: Hardened and monitored
- Uptime: 2+ weeks
- Attack attempts blocked: 500+
Access & Links
Infrastructure Services:
- Grafana: http://localhost:3000
- Prometheus: http://localhost:9090
- Gitea: http://localhost:3001
- Drone CI: http://localhost:8080
- Harbor: http://localhost
Production:
- Portfolio Website: http://[oracle-ip]
- Status: Live and deployed
- Deploy time: 3 minutes from git push
- Automation: 100%
What's Next
Immediate Improvements (This Week)
- SSL Certificate
- Get domain name
- Configure Let's Encrypt
-
Enable HTTPS everywhere
-
Better Monitoring
- Application-level metrics
- Custom Grafana dashboards
-
Alert routing to Slack/Email
-
Security Hardening
- Rate limiting in Nginx
- Security headers (HSTS, CSP)
- Regular security audits
Short-Term Projects (Next Month)
- Transport Wallet API
- Python FastAPI backend
- PostgreSQL database
- Redis caching
-
Deploy using existing pipeline
-
AWS Multi-Tier Application
- Learn Terraform
- Deploy to AWS
-
Compare self-hosted vs cloud
-
Kubernetes Migration
- Set up K3s cluster
- Deploy apps to K8s
- Learn container orchestration
Long-Term Goals (Next 3 Months)
- Advanced Monitoring
- ELK stack
- Distributed tracing
-
APM tools
-
GitOps Implementation
- ArgoCD
- Declarative configuration
-
Git as source of truth
-
Multi-Cloud Strategy
- Deploy to AWS, GCP, Azure
- Compare platforms
- Learn cloud-agnostic design
Advice for Anyone Starting This Journey
Do This:
1. Start with a real project
Don't just follow tutorials. Build something you'll actually use. A portfolio site, a blog, a tool you need. Real projects have real problems, and solving real problems is how you learn.
2. Document everything
Write down every problem you face and how you solve it. Your future self will thank you. So will others who hit the same issues.
3. Embrace the debugging
You'll spend 30-40% of your time debugging. This is normal. This is where the learning happens. The tutorials skip this part, but this is the most valuable part.
4. Read the official docs
Stack Overflow is helpful, but the official documentation is where you build real understanding. Read it. Even when it's boring.
5. Build observability first
Set up monitoring and logging before you build anything else. You can't debug what you can't see.
6. Version control everything
Your infrastructure configs, your deployment scripts, your documentation. Put it all in Git. You'll need it when things break at 2am.
Don't Do This:
1. Wait until you "feel ready"
You'll never feel ready. Start building. You'll learn along the way.
2. Skip the fundamentals
Don't jump straight to Kubernetes without understanding Docker. Don't use Terraform without understanding what it's automating. Learn the basics first.
3. Copy-paste without understanding
That Stack Overflow answer might work, but if you don't understand why, you'll be stuck when it breaks in a different way.
4. Ignore security
Don't deploy to production without basic security. Firewall, fail2ban, HTTPS. The attacks start immediately. Be ready.
5. Give up on long debugging sessions
That 3-hour YAML debugging session? That's where I learned YAML deeply. Persistence pays off.
6. Deploy without monitoring
If you can't see what's happening, you can't fix it when it breaks. And it will break.
The Truth About DevOps
DevOps tutorials make it look clean:
- Write a config file
- Run a command
- Everything works
- Celebrate
Reality is messier:
- Write a config file
- Run a command
- Error
- Debug for 2 hours
- Google the error
- Try 5 different solutions
- None work
- Read the documentation
- Find the actual problem
- Fix it
- Run the command
- Different error
- Repeat
The gap between tutorials and reality is where the learning happens.
Nobody talks about:
- The 3-hour YAML debugging sessions
- The networking concepts you don't understand
- The errors that make no sense
- The moments you want to give up
- The small victories that keep you going
But this is the job. DevOps is 10% configuration and 90% debugging. If you can systematically debug production issues, you're valuable. If you can build working infrastructure from scratch, you're hireable.
Final Thoughts
Three weeks ago, I had a 4GB laptop and zero DevOps experience.
Today, I have:
- Production infrastructure running in the cloud
- Automated deployments (git push → live in 3 minutes)
- Complete monitoring and logging
- 11 problems solved and documented
- Deep understanding of Docker, CI/CD, and networking
- Proof that I can build and debug real systems
The infrastructure is running. The automation works. The security is hardened. The monitoring is comprehensive.
This isn't a tutorial project. This is production infrastructure that I built, debugged, and deployed myself.
And if I can do it in 3 weeks with a used laptop and an internet connection, anyone can.
The question isn't whether you're ready. The question is whether you're willing to start.
Resources & Links
My Infrastructure:
- GitHub: [link to your repo]
- Live Site: http://[your-site]
- Architecture Diagrams: [link]
Tools Used:
- Docker & Docker Compose
- Gitea
- Drone CI
- Harbor
- HashiCorp Vault
- Prometheus & Grafana
- Loki & Promtail
- WireGuard
- Oracle Cloud
Helpful Documentation:
- Docker Networking: https://docs.docker.com/network/
- Drone CI: https://docs.drone.io/
- Harbor: https://goharbor.io/docs/
- Prometheus: https://prometheus.io/docs/
My Other Posts:
- Docker Networking: Why localhost Isn't localhost
- The YAML Nightmare: 3 Hours Debugging One Space
- Attacked in 5 Minutes: Security Lessons
Comments & Discussion
Did you build something similar? Hit the same problems? Have questions about my setup?
Drop a comment below. I read and respond to every one.
If this helped you, share it with others who are learning DevOps. The more people who understand that the struggle is normal, the more people will push through and succeed.
Now go build something. Break something. Debug it. Learn from it.
That's how you become a DevOps engineer.
Written by Bayo
Building in public | Documenting everything | Learning DevOps the hard way
Follow my journey:
GitHub • LinkedIn • Twitter
Published: March 8, 2026
Last Updated: March 8, 2026
Reading Time: 25 minutes
Word Count: ~8,500 words
If you made it this far, you're serious about learning DevOps. Welcome to the journey.
Omobayonle Ogundele
DevOps Engineer based in Lagos, Nigeria. Building reliable infrastructure and sharing logs from the edge of production.
Comments (0)
No comments yet. Be the first!