Learn Cloud with Amina

2-month roadmap

The Full Curriculum

Eight focused weeks. Each one builds directly on the last. By the end you will have a running Kubernetes workload, an IaC repo, a CI/CD pipeline, and dashboards proving it all works.

WEEK 01 Linux, Networking and Git bash tcp/ip git ▶

Why this week exists

Everything in cloud runs on Linux. Every deployment is triggered by a Git push. Every network call obeys TCP/IP. This week you build the mental models before they appear inside an abstraction layer.

Linux filesystem hierarchy: what lives in /etc, /var, /proc and why
Processes, signals, and the init system (systemd)
Networking primitives: IP addressing, CIDR, routing, DNS resolution, TCP handshake
SSH: key-based auth, agent forwarding, port tunnelling
Git internals: blobs, trees, commits, the reflog, detached HEAD
Branching strategy: trunk-based development vs GitFlow and when each fits

What you will build

Provision a raw Ubuntu VM on any free tier (GCP e2-micro, AWS t2.micro, or local Multipass). Do not use the console wizard for networking config.

Write a bash script that audits open ports and writes a report to /var/log/audit.txt on a schedule via cron
Diagnose a deliberately broken DNS config using only dig, ss, and traceroute
Create a Git repo, force a merge conflict, resolve it, squash the fix into one clean commit, and write a useful commit message
Set up SSH key auth and harden /etc/ssh/sshd_config to refuse password login

Week Deliverable

GitHub repo with your audit script, a network troubleshooting runbook in Markdown, and a post-mortem on the one thing that broke during the lab.

Decisions you will encounter in the real world

Bash vs Python for automation scripts: bash is everywhere but brittle at scale; Python is portable but adds a runtime dependency
SSH keys vs certificates (HashiCorp Vault SSH, AWS SSM): keys do not expire, certificates carry metadata and rotate automatically
Merge commits vs squash vs rebase: merge preserves history, squash keeps main clean, rebase rewrites history and can break shared branches
Monorepo vs polyrepo: monorepo simplifies atomic changes across services but CI scales poorly without tooling like Turborepo or Nx

WEEK 02 Cloud Fundamentals and IAM Security gcp iam least-privilege ▶

The cloud mental model

Cloud is not a magic computer farm. It is an API over hardware with a billing model. Understanding the shared responsibility model tells you exactly where your obligations start.

IaaS vs PaaS vs SaaS and where your code lives on that spectrum
Regions, zones, and why you spread workloads across them
IAM: principals (users, service accounts, groups), resources, roles, and the policy evaluation logic
Principle of least privilege: why roles/owner on a service account is a security incident waiting to happen
Service account keys vs Workload Identity Federation: the key is a credential that lives on disk; Workload Identity is ephemeral
Audit logging: what Cloud Audit Logs record and why you turn on Data Access logs from day one

What you will build

Set up a GCP project from scratch using only the CLI. Zero console clicks for resource creation.

Create a project, enable billing alerts at $5 and $10, and enable only the APIs you need
Create a service account with the minimum roles to write to a GCS bucket and nothing else. Verify it cannot read IAM policies.
Enable Data Access audit logs and trigger a denied action. Find it in Cloud Logging within 2 minutes.
Simulate a credential leak: put a fake key in a public GitHub repo (a test-only repo), watch Secret Scanner flag it, rotate immediately

Week Deliverable

A written IAM policy document for a hypothetical three-tier web application listing every principal, what role they hold, and why no broader role was appropriate.

Decisions you will encounter in the real world

Predefined roles vs custom roles: predefined are maintained by Google, custom give you precision but require you to track permission changes yourself
Per-project service accounts vs shared service accounts: shared simplifies management, per-project contains blast radius
Organization policy constraints vs IAM: org policies are guardrails you cannot override with IAM; use them for non-negotiable controls
Billing alerts vs budget caps: alerts notify, caps actually stop spending but can take down production if limits are set too low

WEEK 03 Compute, Storage and Cloud Networking vpc gce gcs ▶

Where your code actually runs

Before you containerise anything you need to understand the machine underneath. This week you learn how virtual machines, object storage, and virtual networks combine into a working application environment.

VPC architecture: subnets, routes, firewall rules, NAT, Private Google Access
Compute Engine: machine families, persistent disks, startup scripts, preemptible vs standard
Cloud Storage: buckets, object lifecycle policies, signed URLs, storage classes (Standard, Nearline, Coldline)
Load balancing concepts: L4 vs L7, health checks, backend services
Cloud DNS and how internal DNS resolution differs from public
VPC peering vs Shared VPC: when each model applies in an org context

What you will build

Deploy a two-tier application: a backend VM in a private subnet with no public IP, fronted by an HTTP load balancer. All network config written as gcloud commands you can repeat.

Create a custom VPC with two subnets (public and private) in separate zones
Deploy a backend VM in the private subnet, configure Cloud NAT for outbound internet
Serve a static site from GCS with a custom domain and HTTPS via a managed certificate
Set up a lifecycle rule to move objects older than 30 days to Nearline and delete after 90

Week Deliverable

A network architecture diagram (draw.io or Excalidraw) showing every subnet, firewall rule, and traffic path. Annotate each decision with one sentence explaining why.

Decisions you will encounter in the real world

VM vs Cloud Run vs GKE for a stateless service: VMs give control, Cloud Run eliminates operations, GKE gives you the full platform surface
Standard vs auto-mode VPC: auto-mode is fast to start with but you cannot control CIDR ranges, which matters for VPC peering
External LB vs internal LB: external terminates TLS, internal is cheaper but only reachable inside the VPC
Object storage vs block storage vs file storage: each has a different access pattern and cost model; choosing wrong costs real money

WEEK 04 Docker and Container Security docker oci artifact-registry ▶

Containers are not virtual machines

A container shares the host kernel. Understanding that single fact explains every security property, every limitation, and every escape vector containers have.

Linux namespaces and cgroups: the primitives Docker wraps
Image layers, union filesystem, and why layer order matters for cache and size
Dockerfile best practices: multi-stage builds, non-root users, minimal base images
Container security: read-only filesystems, dropped capabilities, seccomp profiles
Artifact Registry: pushing, pulling, image vulnerability scanning with Container Analysis
Container runtime threat model: what an attacker can and cannot do inside a container

What you will build

Containerise a small web application with deliberate security mistakes, then fix every one.

Write a Dockerfile that runs as root. Measure the image size. Then rewrite it: multi-stage build, non-root user, minimal base. Compare.
Push to GCP Artifact Registry. Enable vulnerability scanning. Fix the first CVE it reports.
Run the container with --read-only --cap-drop ALL --security-opt no-new-privileges. Debug why it crashes and fix the app, not the flags.
Deploy the container to Cloud Run and confirm it is reachable over HTTPS

Week Deliverable

Two Dockerfiles (before and after), a written security audit listing each original vulnerability and how it was addressed, and the live Cloud Run URL.

Decisions you will encounter in the real world

Distroless vs Alpine vs Ubuntu base images: distroless has the smallest attack surface but no shell for debugging; Alpine is small and has a shell; Ubuntu is familiar but large
One process per container vs multiple: one process makes health checks precise and restarts fast; multiple can simplify sidecar patterns at the cost of lifecycle coupling
Building in CI vs building locally: local builds are fast to iterate, CI builds are reproducible and auditable. Both are necessary.
Image tagging strategy: latest is ambiguous in production. Always tag by git SHA or semantic version.

WEEK 05 Infrastructure as Code with Terraform terraform opentofu state ▶

Infrastructure that explains itself

Clicking through the console does not scale, does not survive staff turnover, and does not pass a security audit. IaC is the practice of treating infrastructure with the same engineering discipline as application code.

Declarative vs imperative: you declare the desired state, Terraform figures out the diff
Terraform core workflow: init, plan, apply, destroy
State: what it is, why it must be stored remotely, and what happens when it drifts
Modules: how to encapsulate reusable infrastructure patterns
Variables, outputs, and data sources
Locking state with GCS backend: preventing concurrent apply collisions

What you will build

Recreate everything you built in Weeks 3 and 4 using Terraform. Delete the manually-created resources first. Your Terraform code is the only source of truth.

Write a Terraform module for VPC + subnets that accepts region and CIDR as variables
Store state in a GCS bucket with versioning enabled. Verify you can roll back state after an accidental terraform apply.
Run terraform plan before every apply in a CI-like loop and review the diff
Deliberately cause state drift by deleting a resource in the console. Use terraform refresh and document what happened.

Week Deliverable

A Terraform repo in GitHub with a VPC module, a Cloud Run module, and a README explaining how to deploy the full stack from scratch in one terraform apply.

Decisions you will encounter in the real world

Terraform vs Pulumi vs CDK: Terraform is the lingua franca; Pulumi and CDK let you use real programming languages but have smaller communities and less third-party module coverage
Monolithic root module vs small modules: large modules are simpler early on but become dangerous to apply as the plan grows; split early
Remote state locking: GCS provides object-level locking which is good enough; Terraform Cloud provides locking with a UI and team access controls
When to import existing resources: importing is the right answer when you cannot afford downtime to recreate; it adds complexity and should be cleaned up afterward

WEEK 06 CI/CD Pipelines and Automation github-actions cloud-build gitops ▶

Every merge is a deployment decision

A CI/CD pipeline is not a deployment script. It is the automated enforcement of your quality and security policy. Every step is a gate that protects production from humans.

CI vs CD vs CD: continuous integration, continuous delivery, and continuous deployment and what distinguishes each
GitHub Actions: workflow syntax, triggers, jobs, steps, contexts, and secrets
Workload Identity Federation: why you do not store GCP service account keys as GitHub secrets
Pipeline stages: lint, test, build, scan, deploy, smoke test
GitOps: the repo as the single source of truth for cluster state
Rollback strategies: redeploy previous tag vs feature flags vs canary deployments

What you will build

Wire your Week 4 containerised app to a full GitHub Actions pipeline that deploys to Cloud Run on every push to main.

Configure Workload Identity Federation so the pipeline authenticates to GCP without any long-lived key
Add a job that runs container vulnerability scanning and fails the pipeline on CRITICAL severity CVEs
Add a manual approval step before production deployment using GitHub Environments
Simulate a bad deploy: push a broken image and practice rolling back to the previous revision in Cloud Run

Week Deliverable

A working pipeline with at least 4 stages, the Workload Identity Federation config documented, and a written incident report on the rollback exercise.

Decisions you will encounter in the real world

GitHub Actions vs Cloud Build vs Tekton: Actions is easy to start with and has the largest marketplace; Cloud Build is tightly integrated with GCP; Tekton runs in-cluster and is complex but portable
Environment-per-branch vs environment-per-PR: per-branch is simpler to manage; per-PR is more isolated but multiplies infrastructure cost
Fail open vs fail closed on security scans: fail closed blocks deployments on new CVEs, including CVEs in images you did not change. Plan for the false-positive rate.
Blue-green vs canary vs rolling: blue-green is safest to roll back; canary catches issues with a small blast radius; rolling is the default and has no rollback if state changes are involved

WEEK 07 Kubernetes Fundamentals gke kubectl rbac ▶

The platform under the platform

Kubernetes is a container orchestrator but it is more useful to think of it as a declarative API for distributed systems. The control loop concept, where the system perpetually reconciles desired state with actual state, is the idea that everything else builds on.

Cluster architecture: control plane (API server, etcd, scheduler, controller-manager) vs worker nodes
Core objects: Pod, Deployment, Service, ConfigMap, Secret, Namespace, PersistentVolumeClaim
Scheduling: node selectors, affinity, taints and tolerations
Networking: ClusterIP vs NodePort vs LoadBalancer, Ingress, kube-dns
RBAC: Roles, ClusterRoles, RoleBindings, the relation to GKE Workload Identity
Resource requests and limits: what happens when a container exceeds memory vs CPU

What you will build

Deploy your containerised application from Week 4 onto GKE Autopilot. Operate it: scale it, break it, and recover it.

Write Deployment and Service manifests. Deploy via kubectl apply. Verify with kubectl rollout status.
Configure horizontal pod autoscaler. Load-test with hey or k6 and watch pods scale. Watch them scale back down.
Deliberately kill all pods. Observe the ReplicaSet recreate them. Record how long recovery takes.
Set up RBAC so a read-only service account can describe pods but cannot exec into them

Week Deliverable

All Kubernetes manifests in a dedicated k8s/ directory in your repo, a load test report showing autoscaler behaviour, and a written explanation of each RBAC binding and why it grants exactly that scope.

Decisions you will encounter in the real world

Autopilot vs Standard GKE: Autopilot manages nodes for you and is cheaper for intermittent workloads; Standard gives control over node pools, machine types, and GPU access
Helm vs raw manifests vs Kustomize: raw manifests are easiest to understand; Helm packages reusable charts but templating logic gets complex; Kustomize overlays without templating
Ingress vs Gateway API: Ingress is stable and understood; Gateway API is the successor and handles more routing patterns but tooling support varies
Namespaces for isolation: namespaces are soft boundaries, not hard security boundaries. Multi-tenant workloads with different trust levels need separate clusters.

WEEK 08 Observability, SRE and Capstone prometheus cloud-monitoring slo ▶

You cannot improve what you cannot measure

Observability is not dashboards. It is the property of a system that lets you ask arbitrary questions about its internal state from external outputs. This week you wire up your full stack so nothing can fail silently.

The three pillars: metrics, logs, and traces and when each one answers a different class of question
SLI, SLO, and SLA: defining a service level indicator, writing a service level objective, and the error budget that follows from it
Cloud Monitoring: uptime checks, dashboards, alerting policies, notification channels
Cloud Logging: structured logs, log-based metrics, log sinks to BigQuery for analysis
Distributed tracing with Cloud Trace: how a trace spans multiple services
Incident management: the on-call rotation, incident commander role, post-mortem process

What you will build (Capstone)

Instrument the full stack you built across Weeks 1 to 7. Write SLOs. Break things intentionally and prove your alerting catches it before a user does.

Add structured JSON logging to your application. Create a log-based metric for error rate. Alert when error rate exceeds 1% over 5 minutes.
Write two SLOs: a latency SLO (95% of requests under 500ms) and an availability SLO (99.5% uptime). Configure error budget burn rate alerts.
Conduct a chaos experiment: terminate pods at random using a script. Confirm your dashboards show the event. Write a post-mortem with timeline, root cause, and action items.
Add your CI/CD pipeline deployment events as annotations on your dashboards. Correlate a past deployment with a latency spike.

Final Capstone Deliverable

A public GitHub portfolio repo containing: Terraform code, Kubernetes manifests, CI/CD pipeline, application code, monitoring dashboards (exported JSON), two defined SLOs, and a capstone post-mortem. This is your first production-grade portfolio project.

Decisions you will encounter in the real world

Cloud-native observability vs self-managed: Cloud Monitoring is zero-ops but costs money and locks you to GCP; Prometheus/Grafana/Loki is portable and customisable but you own the operations
Structured logs vs unstructured: structured (JSON) logs are queryable; unstructured logs require regex and are painful at scale. Default to structured from the start.
Alerting on symptoms vs causes: alert on slow latency and high error rate (symptoms your users feel), not on CPU and memory (causes you investigate after being paged)
SLO strictness: a 99.9% SLO gives you 43 minutes of allowed downtime per month. Every nine you add costs disproportionately in engineering and infrastructure spend.

Learn Cloud, DevOpsand Platform Engineering

Learning How to Think Like an Engineer

Why Before How

Break Things on Purpose

Document Everything

Real Infrastructure Only

The Full Curriculum

Why this week exists

What you will build

Decisions you will encounter in the real world

The cloud mental model

What you will build

Decisions you will encounter in the real world

Where your code actually runs

What you will build

Decisions you will encounter in the real world

Containers are not virtual machines

What you will build

Decisions you will encounter in the real world

Infrastructure that explains itself

What you will build

Decisions you will encounter in the real world

Every merge is a deployment decision

What you will build

Decisions you will encounter in the real world

The platform under the platform

What you will build

Decisions you will encounter in the real world

You cannot improve what you cannot measure

What you will build (Capstone)

Decisions you will encounter in the real world

Everything You Need Costs Nothing

Command Line Foundations

Cloud Fundamentals

Docker and Terraform

Orchestration

Operating at Scale

Stay Connected

Who Is Teaching You

Before You Apply

Cohort 1 is Full

Learn Cloud, DevOps
and Platform Engineering