BACK TO ENTRIES
LOG_ENTRY: 2026.03.14 · 37 TELEMETRY_HITS

Building Production-Grade DevOps Infrastructure: A Complete Journey

B

Omobayonle Ogundele

MAIN_NODE: DEVOPS_ENGINEER

A detailed account of building a complete DevOps platform from scratch over three weeks, including every problem encountered and solution implemented.


Table of Contents

  1. The Beginning: A 4GB Laptop and a Dream
  2. Week 1: Building the Foundation
  3. Week 2: The CI/CD Pipeline From Hell
  4. Week 3: Going to Production
  5. The Complete Problem Log
  6. What I Actually Learned
  7. The Final Result

The Beginning: Project Objectives {#the-beginning}

Three weeks ago, I began a project to build a complete DevOps infrastructure platform from scratch. The goal was to move beyond theoretical knowledge from tutorials and documentation to hands-on implementation of production-grade systems.

Starting Resources:
- HP Stream laptop (4GB RAM) - initial testing environment
- HP EliteBook 840 G2 (8GB RAM) - main development machine
- Foundational knowledge of Docker and containerization
- Ubuntu 24.04 as the base operating system

Project Objectives:

The goal was to build enterprise-grade infrastructure components:

  • Self-hosted Git server for version control
  • Automated CI/CD pipeline for continuous integration and deployment
  • Private container registry for image management
  • Centralized secrets management
  • Comprehensive monitoring and logging infrastructure
  • Production deployment to cloud infrastructure
  • Complete automation from code commit to production

This project aimed to demonstrate the ability to design, implement, and troubleshoot complex infrastructure systems.


Day 0: The Hardware Wake-Up Call

I started by spinning up Prometheus, Grafana, and Gitea on my HP Stream laptop.

The laptop immediately became a bottleneck.

Containers took 2-3 minutes to start. The system was constantly swapping to disk. Running multiple services simultaneously brought the machine to a crawl. I'd start docker-compose up and wait several minutes for all services to initialize.

This was my first lesson: hardware matters.

Fortunately, I had an HP EliteBook 840 G2 with 8GB RAM that I use for my main work. I decided to use it for the homelab project as well. The difference was immediate and dramatic.

Containers started in seconds. I could run 10+ services simultaneously without performance issues. The development workflow became significantly smoother.

Lesson: Adequate hardware is essential for running containerized infrastructure. 8GB RAM should be considered the minimum for multi-container development environments.


Week 1: Building the Foundation {#week-1-foundation}

Day 1: The Monitoring Stack

I started with what every good DevOps engineer starts with: observability.

You can't fix what you can't see. So before building anything else, I needed to be able to monitor it.

The plan:
- Prometheus for metrics collection
- Grafana for visualization
- Node Exporter for system metrics

I created my project structure:

homelab/
├── monitoring/
│   ├── prometheus/
│      └── prometheus.yml
│   ├── grafana/
│      └── provisioning/
│   └── docker-compose.yml

The Prometheus config was straightforward:

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

Ran docker-compose up -d.

It worked. First try.

I should have known it wouldn't last.


Day 2: The Git Server (First Real Problem)

Gitea is a self-hosted Git service. Think GitHub, but running on your own infrastructure.

I spun it up with PostgreSQL as the database:

services:
  gitea-db:
    image: postgres:15-alpine
    environment:
      POSTGRES_PASSWORD: gitea123

  gitea:
    image: gitea/gitea:latest
    environment:
      GITEA__database__PASSWD: gitea123

Started it. Gitea loaded. Perfect.

Then I made a change to the configuration and restarted everything.

Error:

pq: password authentication failed for user "gitea"

Wait, what? I didn't change the password.

I spent 2 hours trying different things:
- Checked the compose file ✓ (passwords matched)
- Restarted containers ✗ (same error)
- Checked logs ✗ (database running fine)
- Googled the error ✗ (generic solutions didn't help)

Finally, it hit me. Docker volumes persist data.

The first time I ran Gitea, Docker created a volume with the database. When I changed the password in the compose file and restarted, the new password was in the environment variables, but the old password was still in the PostgreSQL database stored in the volume.

The fix:

docker-compose down -v  # The -v removes volumes
docker-compose up -d

It worked immediately.

Lesson learned: Docker volumes are persistent storage. If you need a fresh start, you need to explicitly remove them.

Time wasted: 2 hours
Value of lesson: Saved me countless hours in the future


Day 3: The Nested Repository Nightmare

I created a repository in Gitea for my sample application. Cloned it to my laptop. Started adding code.

Then I tried to commit everything to my main homelab repository:

git add .
git commit -m "Add sample app"

Error:

warning: adding embedded git repository: homelab/apps/sample-app
hint: You've added another git repository inside your git repository.

Oh no.

I'd cloned the sample-app repo instead of just copying the files. So now I had a Git repo inside a Git repo. Git doesn't like that.

This one was quick to fix:

cd homelab/apps/sample-app
rm -rf .git
cd ~/homelab
git add .
git commit -m "Add sample app files"

Lesson learned: Don't clone repos into other repos. Copy files or use Git submodules.

Time wasted: 30 minutes
Frustration level: Moderate


Days 4-7: Monitoring Everything

The rest of Week 1 was smoother. I:

  • Set up Grafana dashboards
  • Imported the Node Exporter dashboard (ID: 1860)
  • Configured Prometheus to scrape all my services
  • Added Loki for centralized logging
  • Set up Promtail to ship logs

By the end of Week 1, I had complete observability. I could see CPU usage, memory, disk I/O, network traffic, and logs from all containers.

It felt good. Everything was working.

I thought I was ready for CI/CD.

I was so, so wrong.


Week 2: The CI/CD Pipeline From Hell {#week-2-cicd}

Week 2 nearly broke me. This is where I learned that DevOps is 90% debugging and 10% configuration.

Day 8: Setting Up Drone CI

Drone is a container-native CI/CD platform. Every build runs in a fresh Docker container. It integrates with Gitea via OAuth.

First, I needed to create an OAuth application in Gitea:
- Settings → Applications → Create OAuth2 Application
- Name: Drone CI
- Redirect URI: http://localhost:8080/login

Gitea gave me a Client ID and Client Secret. I generated an RPC secret:

openssl rand -hex 16

Then I configured Drone:

services:
  drone-server:
    image: drone/drone:2
    environment:
      DRONE_GITEA_SERVER: http://localhost:3001
      DRONE_GITEA_CLIENT_ID: [client-id]
      DRONE_GITEA_CLIENT_SECRET: [client-secret]
      DRONE_RPC_SECRET: [rpc-secret]

Started it. Opened http://localhost:8080.

OAuth redirect worked. I authorized with Gitea. Logged in.

Success!

Or so I thought.


Day 9: The localhost That Wasn't localhost

I activated my repository in Drone. Created a simple .drone.yml:

kind: pipeline
type: docker
name: default

steps:
  - name: test
    image: node:18-alpine
    commands:
      - npm install
      - npm test

Committed it. Pushed to Gitea.

The build triggered! Progress!

Then it failed:

fatal: unable to access 'http://localhost:3001/bayo/sample-app.git/': 
Failed to connect to localhost port 3001: Connection refused

But... Gitea was running on localhost:3001. I could access it in my browser. I could clone the repo from my terminal. What was going on?

I spent the next 3 hours in a debugging spiral.

Attempt 1: Maybe it needs host.docker.internal instead of localhost?

DRONE_GITEA_SERVER: http://host.docker.internal:3001

Result:

ERR_NAME_NOT_RESOLVED: host.docker.internal's server IP could not be found

Stack Overflow said to use host.docker.internal. Every tutorial used it. Why wasn't it working?

Because I'm on Linux. host.docker.internal is a Docker Desktop (Mac/Windows) feature. It doesn't exist on Linux.

Attempt 2: Try 127.0.0.1 instead?

DRONE_GITEA_SERVER: http://127.0.0.1:3001

Result:

Failed to connect to 127.0.0.1 port 3001: Connection refused

Same error. What the hell is happening?

Attempt 3: Disable the firewall entirely?

sudo ufw disable

Result: Still connection refused.

At this point I was frustrated. I'd been at this for 2.5 hours. I decided to do something I should have done from the beginning:

I actually read the Docker networking documentation.

Buried in the docs, I found this sentence:

"From the perspective of a container, localhost refers to the container itself, not the host machine."

Wait.

WAIT.

When the Drone container tries to connect to localhost:3001, it's looking inside itself for a Git server. Of course it can't find it. Gitea is running on my host machine, not inside the Drone container!

The solution: use the Docker bridge IP.

ip addr show docker0
inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0

There it was. 172.17.0.1 - the IP address of the host machine as seen from inside containers.

DRONE_GITEA_SERVER: http://172.17.0.1:3001

Restarted Drone. Tried again.

It worked.

After 3 hours, the fix was changing one IP address.

Lesson learned: Containers are isolated. localhost inside a container means the container itself. Use the bridge IP (172.17.0.1) to reach the host from containers.

Time wasted: 3 hours
Knowledge gained: Fundamental understanding of Docker networking
Would I trade it: Absolutely not. This lesson was invaluable.


Day 10: YAML, My Greatest Enemy

With Drone connecting to Gitea, I started writing my actual build pipeline.

I wanted:
1. Clone the repository
2. Run tests
3. Build a Docker image
4. Push to Harbor (my private registry)

Here's what I wrote:

kind: pipeline
type: docker
name: default

steps:
  - name: test
    image: node:18-alpine
    commands:
      - npm install
      - npm test

  - name: build
    image: plugins/docker
    settings:
      registry: 172.16.18.128:8888
      repo: 172.16.18.128:8888/library/sample-app
      tags: latest
      username: admin
      password:
        from_secret: harbor_password

Committed. Pushed. Watched the build start.

Error:

yaml: unmarshal errors:
  line 31: cannot unmarshal !!map into string

Line 31. Let me check line 31. That's... tags: latest.

What could possibly be wrong with tags: latest? It's just a string!

I added quotes: tags: "latest"

Same error.

I checked my indentation. Everything was aligned with 2 spaces. Consistent throughout.

I ran yamllint:

yamllint .drone.yml

It complained about line length but said nothing about structure.

I compared with examples in the Drone docs. My file looked identical.

One hour passed.

I was starting to lose my mind. This was a simple YAML file. What was I missing?

I decided to rewrite the entire file from scratch. Opened a new file. Started typing from memory.

When I got to the Docker build step, I wrote:

settings:
  tags:
    - latest

Wait. Why did I write it like that?

I looked at my broken file:

settings:
  tags: latest

And then I looked at the Drone docs more carefully:

settings:
  tags:
    - latest
    - ${DRONE_COMMIT_SHA}

Oh.

Oh no.

The Docker plugin expects tags to be a list, not a string.

In YAML:
- tags: latest means "tags is a string with value 'latest'"
- tags:\n - latest means "tags is a list containing one string 'latest'"

The plugin was trying to unmarshal a string where it expected a list. That's what the error meant.

The fix:

tags:
  - latest
  - build-${DRONE_BUILD_NUMBER}

Saved. Committed. Pushed. Held my breath.

The build succeeded.

Three. Hours. One. Space.

Actually, one dash. But you get the point.

Lesson learned: YAML is whitespace-sensitive and data-type-sensitive. A string is not a list. A list is not a map. There's no "close enough" in YAML. It either matches the expected structure exactly, or it fails.

Time wasted: 3+ hours
New fear unlocked: YAML
Fun fact: I now validate all YAML with yamllint and python -c "import yaml; yaml.safe_load(open('file.yml'))" before deploying


Day 11-12: Harbor Registry

With Drone working, I needed somewhere to push Docker images. Enter Harbor - an open-source container registry with vulnerability scanning.

I downloaded the installer:

wget https://github.com/goharbor/harbor/releases/download/v2.10.0/harbor-online-installer-v2.10.0.tgz
tar xzvf harbor-online-installer-v2.10.0.tgz
cd harbor

Configured harbor.yml:
- hostname: 172.16.18.128
- http port: 8888
- admin password: Harbor12345

Ran the installer:

sudo ./install.sh

Ten minutes later, Harbor was running. 10 containers, all humming along. Trivy scanner included for vulnerability detection.

I created a project called library and configured Drone to push there.

Updated my pipeline:

- name: build
  image: plugins/docker
  settings:
    registry: 172.16.18.128:8888
    repo: 172.16.18.128:8888/library/sample-app
    tags:
      - latest
      - build-${DRONE_BUILD_NUMBER}
    username: admin
    password:
      from_secret: harbor_password
    insecure: true  # Self-signed cert

Pushed code. Watched the build. Held my breath.

The image built. The push succeeded. I checked Harbor.

There it was. My first automated Docker build, sitting in my private registry.

This felt amazing.


Day 13-14: The Webhook Mystery

Everything was working... when I manually triggered builds in Drone.

But when I pushed code to Gitea, nothing happened.

The webhook existed in Gitea settings. Test delivery showed "200 OK". But real pushes triggered nothing.

I checked Drone logs. No webhook received.

I checked the webhook URL:

http://localhost:8080/hook

Localhost again. Of course.

Changed it to:

http://172.17.0.1:8080/hook

Pushed code. Build triggered!

But wait. It still failed. Different error now:

Failed to clone repository. Branch 'main' not found.

My code was on the master branch. My pipeline was watching main.

Modern Git uses main as the default. Older repos use master. My Gitea installation defaulted to master.

Updated the pipeline:

when:
  branch:
    - main
    - master

Pushed again.

Success! The build triggered automatically and completed.

Lesson learned: Webhooks need exact URLs (no localhost), and branch naming matters. Test thoroughly.

Time wasted: 2 hours
Patience remaining: Running low


Week 3: Going to Production {#week-3-production}

Day 15-16: Setting Up Oracle Cloud

I had everything working on my homelab. Now I needed to deploy to an actual cloud server.

I chose Oracle Cloud's free tier:
- VM instance (AMD shape)
- 1 OCPU, 6GB RAM
- Ubuntu 24.04
- Free forever

I used Ansible to set it up:

---
- name: Setup Docker on Oracle Cloud
  hosts: oracle
  become: yes
  tasks:
    - name: Install Docker
      apt:
        name:
          - docker.io
          - docker-compose
        state: present
        update_cache: yes

    - name: Configure insecure registries
      copy:
        content: |
          {
            "insecure-registries": ["172.16.18.128:8888"]
          }
        dest: /etc/docker/daemon.json

Then I set up a WireGuard VPN tunnel between my homelab and Oracle Cloud:

Homelab (10.0.0.1):

[Interface]
PrivateKey = [homelab-private-key]
Address = 10.0.0.1/24
ListenPort = 51820

[Peer]
PublicKey = [oracle-public-key]
AllowedIPs = 10.0.0.2/32

Oracle Cloud (10.0.0.2):

[Interface]
PrivateKey = [oracle-private-key]
Address = 10.0.0.2/24

[Peer]
PublicKey = [homelab-public-key]
AllowedIPs = 10.0.0.1/32
Endpoint = [my-home-ip]:51820

Started WireGuard on both sides:

sudo wg-quick up wg0

Tested:

ping 10.0.0.2  # From homelab
ping 10.0.0.1  # From cloud

Both worked! Secure tunnel established.


Day 17: The Clone Problem (Again)

I added a deploy step to my Drone pipeline:

- name: deploy
  image: appleboy/drone-ssh
  settings:
    host: 10.0.0.2
    username: ubuntu
    key:
      from_secret: oracle_ssh_key
    script:
      - docker pull 10.0.0.1:8888/library/sample-app:latest
      - docker stop sample-app || true
      - docker rm sample-app || true
      - docker run -d --name sample-app -p 5000:3000 10.0.0.1:8888/library/sample-app:latest

But the default clone step was still using localhost. And I'd learned that using localhost in containers doesn't work.

Solution: disable the default clone and write my own:

clone:
  disable: true

steps:
  - name: clone
    image: alpine/git
    commands:
      - git clone http://172.17.0.1:3001/bayo/sample-app.git .
      - git checkout $DRONE_COMMIT

Pushed code. The pipeline ran. The deploy step executed.

I checked the Oracle Cloud server:

docker ps

There it was. My application. Running. In the cloud. Deployed automatically.

I opened my browser: http://[oracle-ip]:5000

It loaded.

My first automated deployment to the cloud.


Day 18: Building the Portfolio Website

With the pipeline working, I built my actual portfolio website:

  • Flask backend
  • SQLAlchemy ORM
  • PostgreSQL database
  • Jinja2 templates
  • TailwindCSS for styling

Created a proper Dockerfile:

FROM python:3.11-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

EXPOSE 5000

CMD ["python", "run.py"]

Note that CMD. This caused me an hour of debugging later.


Day 19: The Container Crash Loop

Deployed the portfolio website. The pipeline succeeded. Image pushed to Harbor. Deployment to cloud executed.

I checked the logs:

docker logs portfolio
Failed to find attribute 'app' in 'app'.
Worker (pid:7) exited with code 4
App failed to load.

The container kept crashing and restarting.

I'd initially written:

CMD ["gunicorn", "--bind", "0.0.0.0:5000", "app:app"]

Gunicorn was looking for:
- A file named app.py
- With a variable named app

My project structure:

portfolio/
├── app/
│   ├── __init__.py
│   └── routes/
├── run.py
└── Dockerfile

No app.py file. The entry point was run.py.

Changed the Dockerfile:

CMD ["python", "run.py"]

Pushed. Deployed. Checked logs.

It worked.

Lesson learned: Container entry points must match actual code structure. Check what files you actually have!

Time wasted: 1 hour
Obvious in hindsight: Extremely


Day 20: The Firewall Surprise

Container running. Logs clean. Everything perfect.

Opened browser: http://[oracle-ip]

This site can't be reached
Connection refused

But the container was running! I could curl it from inside the VM!

# From Oracle Cloud VM
curl http://localhost:80
# Works!

# From my laptop
curl http://[oracle-ip]
# Connection refused

Firewall.

Oracle Cloud blocks ALL traffic by default. I needed to open port 80 in two places:

1. VM iptables:

sudo iptables -I INPUT 6 -m state --state NEW -p tcp --dport 80 -j ACCEPT
sudo netfilter-persistent save

2. Oracle Cloud Security List:
- Compute → Instances → Subnet → Security List
- Add Ingress Rule:
- Source: 0.0.0.0/0
- Protocol: TCP
- Port: 80

Saved the rule. Waited 30 seconds.

Tried again from my laptop:

curl http://[oracle-ip]

It worked!

Opened browser. My portfolio website loaded. From the internet. Deployed automatically via my CI/CD pipeline.

This was the moment everything clicked.


Day 21: Welcome to the Internet

I was proud. My website was live. I'd built the entire infrastructure myself.

I decided to check the logs to make sure everything was running smoothly:

docker logs portfolio --tail 50

What I saw made my stomach drop:

89.248.168.239 - - [04/Mar/2026 21:20:35] "POST /vendor/phpunit/phpunit/src/Util/PHP/eval-stdin.php HTTP/1.1" 404 -
89.248.168.239 - - [04/Mar/2026 21:20:36] "POST /public/vendor/phpunit/phpunit/src/Util/PHP/eval-stdin.php HTTP/1.1" 404 -
89.248.168.239 - - [04/Mar/2026 21:20:37] "POST /web/vendor/phpunit/phpunit/src/Util/PHP/eval-stdin.php HTTP/1.1" 404 -
89.248.168.239 - - [04/Mar/2026 21:20:38] "POST /site/vendor/phpunit/phpunit/src/Util/PHP/eval-stdin.php HTTP/1.1" 404 -
[hundreds more lines...]

My website had been live for 5 minutes and was already under attack.

Automated scanners were probing for:
- PHPUnit vulnerabilities
- WordPress exploits
- SQL injection points
- Path traversal vulnerabilities
- Directory listing
- Old framework versions

All of them returned 404. My Flask app doesn't have any of those paths. I wasn't vulnerable.

But that was luck, not planning.

I'd deployed to production without:
- Firewall rules (beyond port opening)
- Rate limiting
- Fail2Ban
- HTTPS
- Security headers
- Any real hardening

The attacks taught me something crucial: the internet is hostile by default.

Every public IP address is being scanned constantly. Automated bots probe for any weakness. The question isn't if you'll be attacked. It's when, and whether you'll be vulnerable when it happens.

I immediately:

1. Enabled UFW:

sudo ufw allow 22/tcp
sudo ufw allow 80/tcp
sudo ufw allow 443/tcp
sudo ufw enable

2. Installed Fail2Ban:

sudo apt install fail2ban
sudo systemctl enable fail2ban

3. Added monitoring alerts:

- alert: HighErrorRate
  expr: rate(http_requests_total{status="404"}[5m]) > 100
  annotations:
    summary: "High rate of 404 errors - possible attack"

4. Planned for HTTPS:
- Get domain name
- Configure Let's Encrypt
- Add SSL certificate
- Enable HSTS

Lesson learned: Security is not optional. It's not something you add later. The moment your server is public, the attacks start. Be ready from day one.

Time to first attack: 5 minutes
Number of attack attempts in first hour: 500+
Vulnerability exploitation success rate: 0% (but only by luck)


The Complete Problem Log {#complete-problems}

Here's every problem I faced, in chronological order:

Problem 1: Hardware Constraints

  • Issue: 4GB RAM insufficient for multiple containers
  • Symptom: Extreme slowness, constant swapping, 2-3 minute container startup times
  • Solution: Switched to HP EliteBook 840 G2 with 8GB RAM (main work laptop)
  • Time: Initial testing showed the issue, migration completed in under an hour
  • Lesson: Adequate hardware is essential for containerized development; 8GB RAM minimum recommended

Problem 2: Database Password Mismatch

  • Error: pq: password authentication failed
  • Cause: Docker volumes persisted old password
  • Solution: docker-compose down -v
  • Time: 2 hours
  • Lesson: Volumes persist data across restarts

Problem 3: Nested Git Repositories

  • Error: adding embedded git repository
  • Cause: Cloned repo had its own .git folder
  • Solution: rm -rf .git in nested directory
  • Time: 30 minutes
  • Lesson: Don't nest git repos

Problem 4: Docker Networking - localhost

  • Error: Connection refused to localhost:3001
  • Cause: localhost inside container ≠ host machine
  • Solution: Used Docker bridge IP (172.17.0.1)
  • Time: 3 hours
  • Lesson: Containers are isolated

Problem 5: YAML Unmarshal Errors

  • Error: cannot unmarshal !!map into string
  • Cause: tags: latest should be list format
  • Solution: tags:\n - latest
  • Time: 3+ hours
  • Lesson: YAML data types are strict

Problem 6: Webhook Not Triggering

  • Issue: Builds didn't start on git push
  • Causes: localhost URL, branch name mismatch
  • Solution: Bridge IP, watch both main and master
  • Time: 2 hours
  • Lesson: Test webhooks thoroughly

Problem 7: Drone Runner Connection

  • Error: http: no content returned
  • Cause: Runner using localhost instead of service name
  • Solution: DRONE_RPC_HOST: drone-server
  • Time: 1 hour
  • Lesson: Use service names for container communication

Problem 8: Git Clone in Pipeline

  • Error: Connection refused during clone
  • Cause: Default clone using localhost
  • Solution: Custom clone step with bridge IP
  • Time: 1.5 hours
  • Lesson: Override defaults when needed

Problem 9: CI Authentication

  • Error: terminal prompts disabled
  • Cause: CI can't do interactive auth
  • Solution: Made repo public / used tokens
  • Time: 1 hour
  • Lesson: CI/CD needs non-interactive auth

Problem 10: Container Crash Loop

  • Error: Failed to find attribute 'app'
  • Cause: CMD didn't match code structure
  • Solution: Changed to python run.py
  • Time: 1 hour
  • Lesson: Entry point must match actual files

Problem 11: Cloud Firewall

  • Issue: Can't access from internet
  • Cause: Oracle Cloud blocks all traffic by default
  • Solution: Opened port 80 in security list + iptables
  • Time: 1.5 hours
  • Lesson: Cloud firewalls are locked down

Bonus: Immediate Attacks

  • Discovery: Attacks started within 5 minutes
  • Types: PHPUnit, WordPress, SQL injection, path traversal
  • Result: All blocked (404)
  • Action: Added firewall, Fail2Ban, monitoring
  • Lesson: Internet is hostile, security is mandatory

Total Debugging Time: ~18 hours (37% of project)
Problems Solved: 11/11 (100% success rate)
Most Valuable Lessons: Docker networking, YAML syntax, security awareness


What I Actually Learned {#lessons-learned}

Technical Skills

Docker & Containers:
- Container networking (bridge IPs, service names)
- Volume management and persistence
- Multi-container orchestration
- Image building and optimization
- Registry management

CI/CD:
- Pipeline design and debugging
- Webhook configuration
- Secret management
- Automated deployments
- Zero-downtime updates

Infrastructure:
- Linux server administration
- Firewall configuration
- VPN tunnels (WireGuard)
- Cloud platform basics
- Monitoring and logging

Configuration Management:
- YAML syntax (the hard way)
- Docker Compose
- Environment variables
- Configuration as code

Security:
- Firewall rules (iptables, UFW)
- Fail2Ban setup
- Attack pattern recognition
- Security-first thinking

Soft Skills

Problem Solving:
- Systematic debugging approach
- Reading error messages carefully
- Testing one variable at a time
- Knowing when to start over

Persistence:
- 3-hour debug sessions without giving up
- Trying multiple approaches
- Learning from each failure
- Celebrating small wins

Documentation:
- Writing down solutions immediately
- Creating reference guides
- Explaining complex topics simply
- Sharing knowledge with others

Learning:
- Reading official documentation first
- Understanding fundamentals before shortcuts
- Building mental models
- Connecting concepts across tools

Philosophy Changes

Before: "I need to watch more tutorials before I start."

After: "The only way to learn is by building and breaking things."


Before: "This error is impossible to debug."

After: "This error has a logical cause. I just need to find it."


Before: "DevOps is about knowing all the tools."

After: "DevOps is about understanding how systems work and debugging when they don't."


Before: "I'll add security later when I have time."

After: "Security is built-in from day one, not bolted on later."


The Final Result {#final-result}

What I Built

Infrastructure Components:
- Prometheus (metrics collection)
- Grafana (visualization and dashboards)
- Node Exporter (system metrics)
- Gitea (self-hosted Git server)
- PostgreSQL (database for Gitea)
- Drone CI Server (CI/CD orchestration)
- Drone Runner (build execution)
- Harbor (container registry with 10 internal containers)
- Vault (secrets management)
- Loki (log aggregation)
- Promtail (log shipping)
- WireGuard (VPN tunnel)
- Portfolio Website (Flask application)

Total Containers: 18+
Total Services: 13
Lines of YAML: 500+
Configuration Files: 20+

Workflow

The complete automated workflow:

Developer (me):
    
git commit -m "New feature"
git push origin main
    
Gitea webhook triggers Drone CI
    
Drone Runner starts build
    
1. Clone code (using bridge IP)
2. Run tests (npm test)
3. Build Docker image
4. Push to Harbor registry
    
SSH to Oracle Cloud via WireGuard VPN
    
1. Pull image from Harbor (via 10.0.0.1:8888)
2. Stop old container
3. Remove old container
4. Start new container
5. Health check verification
    
Logs shipped to Loki
Metrics sent to Prometheus
Alerts configured in Grafana
    
Website live at http://[oracle-ip]
    
Total time: 3 minutes
Manual steps: 0

Statistics

Time Investment:
- Week 1 (Monitoring): 8 hours
- Week 2 (CI/CD): 12 hours
- Week 3 (Cloud): 10 hours
- Debugging: 15 hours
- Documentation: 3 hours
- Total: ~48 hours over 3 weeks

Problems Encountered: 11 major issues
Solutions Found: 11 (100% success rate)
Tutorials Watched: 5
Documentation Pages Read: 50+
Stack Overflow Visits: Too many to count
"It works!" Moments: Priceless

Current Status:
- Infrastructure: Running smoothly
- Deployments: Fully automated
- Monitoring: Complete visibility
- Security: Hardened and monitored
- Uptime: 2+ weeks
- Attack attempts blocked: 500+

Access & Links

Infrastructure Services:
- Grafana: http://localhost:3000
- Prometheus: http://localhost:9090
- Gitea: http://localhost:3001
- Drone CI: http://localhost:8080
- Harbor: http://localhost

Production:
- Portfolio Website: http://[oracle-ip]
- Status: Live and deployed
- Deploy time: 3 minutes from git push
- Automation: 100%


What's Next

Immediate Improvements (This Week)

  1. SSL Certificate
  2. Get domain name
  3. Configure Let's Encrypt
  4. Enable HTTPS everywhere

  5. Better Monitoring

  6. Application-level metrics
  7. Custom Grafana dashboards
  8. Alert routing to Slack/Email

  9. Security Hardening

  10. Rate limiting in Nginx
  11. Security headers (HSTS, CSP)
  12. Regular security audits

Short-Term Projects (Next Month)

  1. Transport Wallet API
  2. Python FastAPI backend
  3. PostgreSQL database
  4. Redis caching
  5. Deploy using existing pipeline

  6. AWS Multi-Tier Application

  7. Learn Terraform
  8. Deploy to AWS
  9. Compare self-hosted vs cloud

  10. Kubernetes Migration

  11. Set up K3s cluster
  12. Deploy apps to K8s
  13. Learn container orchestration

Long-Term Goals (Next 3 Months)

  1. Advanced Monitoring
  2. ELK stack
  3. Distributed tracing
  4. APM tools

  5. GitOps Implementation

  6. ArgoCD
  7. Declarative configuration
  8. Git as source of truth

  9. Multi-Cloud Strategy

  10. Deploy to AWS, GCP, Azure
  11. Compare platforms
  12. Learn cloud-agnostic design

Advice for Anyone Starting This Journey

Do This:

1. Start with a real project

Don't just follow tutorials. Build something you'll actually use. A portfolio site, a blog, a tool you need. Real projects have real problems, and solving real problems is how you learn.

2. Document everything

Write down every problem you face and how you solve it. Your future self will thank you. So will others who hit the same issues.

3. Embrace the debugging

You'll spend 30-40% of your time debugging. This is normal. This is where the learning happens. The tutorials skip this part, but this is the most valuable part.

4. Read the official docs

Stack Overflow is helpful, but the official documentation is where you build real understanding. Read it. Even when it's boring.

5. Build observability first

Set up monitoring and logging before you build anything else. You can't debug what you can't see.

6. Version control everything

Your infrastructure configs, your deployment scripts, your documentation. Put it all in Git. You'll need it when things break at 2am.

Don't Do This:

1. Wait until you "feel ready"

You'll never feel ready. Start building. You'll learn along the way.

2. Skip the fundamentals

Don't jump straight to Kubernetes without understanding Docker. Don't use Terraform without understanding what it's automating. Learn the basics first.

3. Copy-paste without understanding

That Stack Overflow answer might work, but if you don't understand why, you'll be stuck when it breaks in a different way.

4. Ignore security

Don't deploy to production without basic security. Firewall, fail2ban, HTTPS. The attacks start immediately. Be ready.

5. Give up on long debugging sessions

That 3-hour YAML debugging session? That's where I learned YAML deeply. Persistence pays off.

6. Deploy without monitoring

If you can't see what's happening, you can't fix it when it breaks. And it will break.


The Truth About DevOps

DevOps tutorials make it look clean:
- Write a config file
- Run a command
- Everything works
- Celebrate

Reality is messier:
- Write a config file
- Run a command
- Error
- Debug for 2 hours
- Google the error
- Try 5 different solutions
- None work
- Read the documentation
- Find the actual problem
- Fix it
- Run the command
- Different error
- Repeat

The gap between tutorials and reality is where the learning happens.

Nobody talks about:
- The 3-hour YAML debugging sessions
- The networking concepts you don't understand
- The errors that make no sense
- The moments you want to give up
- The small victories that keep you going

But this is the job. DevOps is 10% configuration and 90% debugging. If you can systematically debug production issues, you're valuable. If you can build working infrastructure from scratch, you're hireable.


Final Thoughts

Three weeks ago, I had a 4GB laptop and zero DevOps experience.

Today, I have:
- Production infrastructure running in the cloud
- Automated deployments (git push → live in 3 minutes)
- Complete monitoring and logging
- 11 problems solved and documented
- Deep understanding of Docker, CI/CD, and networking
- Proof that I can build and debug real systems

The infrastructure is running. The automation works. The security is hardened. The monitoring is comprehensive.

This isn't a tutorial project. This is production infrastructure that I built, debugged, and deployed myself.

And if I can do it in 3 weeks with a used laptop and an internet connection, anyone can.

The question isn't whether you're ready. The question is whether you're willing to start.


Resources & Links

My Infrastructure:
- GitHub: [link to your repo]
- Live Site: http://[your-site]
- Architecture Diagrams: [link]

Tools Used:
- Docker & Docker Compose
- Gitea
- Drone CI
- Harbor
- HashiCorp Vault
- Prometheus & Grafana
- Loki & Promtail
- WireGuard
- Oracle Cloud

Helpful Documentation:
- Docker Networking: https://docs.docker.com/network/
- Drone CI: https://docs.drone.io/
- Harbor: https://goharbor.io/docs/
- Prometheus: https://prometheus.io/docs/

My Other Posts:
- Docker Networking: Why localhost Isn't localhost
- The YAML Nightmare: 3 Hours Debugging One Space
- Attacked in 5 Minutes: Security Lessons


Comments & Discussion

Did you build something similar? Hit the same problems? Have questions about my setup?

Drop a comment below. I read and respond to every one.

If this helped you, share it with others who are learning DevOps. The more people who understand that the struggle is normal, the more people will push through and succeed.

Now go build something. Break something. Debug it. Learn from it.

That's how you become a DevOps engineer.


Written by Bayo
Building in public | Documenting everything | Learning DevOps the hard way

Follow my journey:
GitHubLinkedInTwitter


Published: March 8, 2026
Last Updated: March 8, 2026
Reading Time: 25 minutes
Word Count: ~8,500 words


If you made it this far, you're serious about learning DevOps. Welcome to the journey.

B

Omobayonle Ogundele

DevOps Engineer based in Lagos, Nigeria. Building reliable infrastructure and sharing logs from the edge of production.

Comments (0)

No comments yet. Be the first!

LEAVE_RESPONSE