# Open Source Hyperscaler — Map of Content > If you wanted to build a hyperscaler — something like AWS or GCP — but entirely on open source tooling, this is the playbook. Each layer below represents a real decision you'd need to make, and the tools listed are the ones that serious operators actually use in production. Nothing here is theoretical. --- ## Why This Matters The big three cloud providers (AWS, Azure, GCP) are proprietary stacks top to bottom. But there's a parallel universe of open source projects that, when composed together, can replicate most of what a hyperscaler does. Companies like OVHcloud, CERN, Walmart, and dozens of telcos already run production clouds on these tools. The trick is knowing which pieces fit where and how they talk to each other. --- ## The Stack at a Glance Think of it as seven layers, each one building on the last: ``` ┌─────────────────────────────────────────────┐ │ 7. Tenant-Facing Services │ │ Billing · DNS · Load Balancing · CDN │ ├─────────────────────────────────────────────┤ │ 6. Security & Identity │ │ Auth · Secrets · Policy · Scanning │ ├─────────────────────────────────────────────┤ │ 5. Observability │ │ Metrics · Logs · Traces · Alerts │ ├─────────────────────────────────────────────┤ │ 4. Orchestration & Automation │ │ Containers · GitOps · IaC · CI/CD │ ├─────────────────────────────────────────────┤ │ 3. Storage │ │ Block · Object · File · NVMe │ ├─────────────────────────────────────────────┤ │ 2. Networking │ │ SDN · Routing · Overlay · Physical NOS │ ├─────────────────────────────────────────────┤ │ 1. Compute Foundation │ │ Bare Metal · Hypervisor · Cloud Platform │ └─────────────────────────────────────────────┘ ``` --- [[Open Source Hyperscaler Stack.svg]] ## Layer 1 — Compute Foundation This is where everything starts. You have physical servers and you need to turn them into something usable. ### Bare Metal Provisioning - **[[MAAS]]** (Metal as a Service) — Canonical's tool for discovering, commissioning, and deploying physical machines. It talks to your servers over IPMI/BMC, PXE boots them, and lays down an OS. This is your "server factory." - **[[Tinkerbell]]** — Lighter alternative from Equinix Metal. Declarative provisioning workflows. Good if you want more control over the boot process. - **[[OpenBMC]]** — Open source baseboard management controller firmware. Gives you visibility into hardware health, fan speeds, temperatures — the stuff that matters when you're running thousands of machines. ### Hypervisor - **[[KVM]]** + **QEMU** — The foundation of nearly every open source cloud. KVM turns your Linux kernel into a hypervisor. QEMU provides the hardware emulation. Together they're what Nova (OpenStack) and every other orchestrator actually use to spin up VMs. - **[[libvirt]]** — The management API that sits on top of KVM/QEMU. Think of it as the translator between your cloud platform and the hypervisor. ### Cloud Platform (IaaS) This is the brain — the thing that takes all your bare metal and presents it as a coherent cloud. - **[[OpenStack]]** — The big one. Used by most non-AWS hyperscalers in production. Its components map directly to cloud services: - **Nova** → Compute (like EC2) - **Neutron** → Networking (like VPC) - **Cinder** → Block storage (like EBS) - **Swift** → Object storage (like S3) - **Keystone** → Identity (like IAM) - **Horizon** → Dashboard (like the AWS Console) - **Heat** → Orchestration (like CloudFormation) - **Ironic** → Bare metal provisioning within OpenStack - **[[OpenNebula]]** — A leaner, simpler alternative to OpenStack. Fewer moving parts, easier to operate with a smaller team. Strong at edge deployments and private cloud. If OpenStack is the enterprise option, OpenNebula is the pragmatist's choice. - **[[Apache CloudStack]]** — Another mature IaaS platform. Less popular in the West but has solid deployments in Asia-Pacific. > **Arrow of Progress:** MAAS provisions bare metal → KVM virtualizes it → OpenStack or OpenNebula orchestrates it into a cloud --- ## Layer 2 — Networking A hyperscaler lives or dies on its network. You need both a physical underlay (actual switches and routers) and a virtual overlay (software-defined networking for tenants). ### Software-Defined Networking - **[[Open vSwitch (OVS)]]** — The foundational virtual switch. Runs on every hypervisor host. Handles VLAN tagging, tunneling, flow rules. - **[[OVN]]** (Open Virtual Network) — Built on top of OVS. Adds distributed routing, load balancing, ACLs, and security groups. This is what OpenStack Neutron typically uses under the hood. - **[[Cilium]]** — eBPF-based networking for containers. Extremely fast because it hooks directly into the kernel. Does network policy, load balancing, encryption, and observability all in one. The modern choice for Kubernetes networking. ### Physical Network - **[[SONiC]]** — Originally from Microsoft, now a Linux Foundation project. It's an open source network operating system that runs on white-box switches (Broadcom, Mellanox). This is how you avoid vendor lock-in on your physical switching fabric. - **[[FRRouting (FRR)]]** — Handles BGP, OSPF, IS-IS, and other routing protocols. This is your underlay routing engine — the thing that makes sure traffic actually gets from rack to rack. - **[[Cumulus Linux]]** (now NVIDIA) — Open-ish network OS. Was fully open, now more commercial, but worth knowing about. > **Arrow of Progress:** FRR + SONiC handle physical routing → OVS provides virtual switching → OVN or Cilium create tenant-isolated overlay networks --- ## Layer 3 — Storage You need three kinds of storage: block (for VM disks), object (for S3-style blob storage), and file (for shared filesystems). - **[[Ceph]]** — The workhorse. A single Ceph cluster gives you all three: - **RBD** → Block storage (backs Cinder in OpenStack) - **RadosGW** → S3-compatible object storage - **CephFS** → POSIX-compliant shared filesystem Ceph is what most production OpenStack deployments use. It's battle-tested at scale. - **[[MinIO]]** — If you only need object storage, MinIO is simpler and faster than Ceph's RadosGW. S3-compatible, very easy to deploy. - **[[Longhorn]]** — Persistent volume management for Kubernetes. Simpler than Ceph for container-native workloads. - **[[SPDK]]** — Intel's Storage Performance Development Kit. For when you need NVMe-over-Fabrics and microsecond latency. More of a building block than a turnkey solution. > **Arrow of Progress:** Ceph provides the unified storage plane → MinIO or RadosGW exposes S3 API → Longhorn bridges into Kubernetes persistent volumes --- ## Layer 4 — Orchestration & Automation Once you have compute, network, and storage, you need to orchestrate workloads and automate everything. ### Container Orchestration - **[[Kubernetes]]** — Non-negotiable for a modern hyperscaler. Every major cloud offers managed Kubernetes, so you need it too. - **[[KubeVirt]]** — Lets you run traditional VMs inside Kubernetes alongside containers. Useful for tenants who aren't fully containerized yet. - **[[Rancher]]** — Multi-cluster Kubernetes management. One pane of glass across all your clusters. - **[[OKD]]** — The open source upstream of Red Hat OpenShift. More opinionated than vanilla Kubernetes, which can be a good thing at scale. ### Infrastructure as Code - **[[OpenTofu]]** — The fully open source fork of Terraform (after HashiCorp changed its license). Declarative infrastructure provisioning. - **[[Ansible]]** — Configuration management and orchestration. The glue that ties everything together — installing packages, configuring services, rolling out updates across thousands of nodes. - **[[Salt]]** — Event-driven automation. Better than Ansible for real-time, reactive operations at hyperscaler scale. ### CI/CD & GitOps - **[[ArgoCD]]** — GitOps operator for Kubernetes. Your Git repo is the source of truth, Argo keeps the clusters in sync. - **[[Tekton]]** — Cloud-native CI/CD pipelines that run as Kubernetes resources. - **[[Gitea]]** or **GitLab CE** — Self-hosted Git forges. > **Arrow of Progress:** OpenTofu defines infrastructure → Ansible configures it → Kubernetes orchestrates workloads → ArgoCD keeps it all in sync with Git --- ## Layer 5 — Observability You cannot run what you cannot see. At hyperscaler scale, observability is not optional — it's the immune system. ### Metrics - **[[Prometheus]]** — Pull-based metrics collection. The standard for cloud-native monitoring. - **[[Thanos]]** or **[[Mimir]]** — Long-term, multi-cluster Prometheus storage. Prometheus alone doesn't scale well past a single cluster; these fix that. - **[[Grafana]]** — Dashboards and visualization. Connects to everything. ### Logs - **[[Loki]]** — Log aggregation designed to work with Grafana. Lightweight, doesn't index full text (indexes labels only), which makes it cheap to run. ### Traces - **[[Jaeger]]** or **[[Tempo]]** — Distributed tracing for understanding request flows across microservices. - **[[OpenTelemetry]]** — The unified collection layer. One SDK, one collector, feeds into Prometheus, Loki, Jaeger, or whatever else you use. ### Infrastructure Monitoring - **[[Zabbix]]** — Hardware-level monitoring. IPMI sensors, disk health, network gear SNMP. The stuff Prometheus doesn't cover well. - **[[Netdata]]** — Real-time system monitoring with zero configuration. Good for per-node visibility. > **Arrow of Progress:** OpenTelemetry collects → Prometheus/Loki/Jaeger store → Grafana visualizes → Zabbix covers the hardware layer --- ## Layer 6 — Security & Identity ### Identity & Access Management - **[[Keycloak]]** — Identity provider. SSO, OIDC, SAML, federation, user management. This is your tenant-facing login system. - **OpenStack Keystone** — IaaS-level identity. Service catalog, token management, project scoping. ### Secrets Management - **[[OpenBao]]** — The fully open source fork of HashiCorp Vault (after the BSL license change). Dynamic secrets, encryption as a service, PKI. - **[[cert-manager]]** — Automated TLS certificate management for Kubernetes. Integrates with Let's Encrypt. ### Runtime Security & Policy - **[[Falco]]** — Runtime security for containers. Detects anomalous behavior (unexpected shell access, privilege escalation, unusual network activity). - **[[Trivy]]** — Vulnerability scanning for container images, filesystems, and IaC. - **[[Open Policy Agent (OPA)]]** — Policy-as-code engine. Enforces rules across admission control, API authorization, and more. ### Multi-Tenancy - **[[Capsule]]** or **[[vCluster]]** — Kubernetes-native multi-tenancy. Capsule uses namespaces with policy guardrails; vCluster creates lightweight virtual clusters. > **Arrow of Progress:** Keycloak authenticates tenants → OPA enforces policy → Falco monitors runtime → OpenBao manages secrets --- ## Layer 7 — Tenant-Facing Services The last mile — what your customers actually interact with. ### DNS - **[[PowerDNS]]** — Authoritative DNS server. Supports API-driven record management, which you need for automated cloud operations. - **[[CoreDNS]]** — Kubernetes-native DNS. Handles internal service discovery. ### Load Balancing & Ingress - **[[HAProxy]]** — High-performance TCP/HTTP load balancer. Battle-proven at massive scale. - **[[Envoy]]** — Modern L4/L7 proxy. The foundation of most service meshes (Istio, etc.). - **[[MetalLB]]** — Bare-metal load balancer for Kubernetes. Gives you LoadBalancer-type services without a cloud provider. - **[[Traefik]]** — Kubernetes ingress controller. Auto-discovers services and routes traffic. ### Billing & Metering This is the hardest gap in open source. No single project does billing well. - **[[CloudKitty]]** — OpenStack's rating and chargeback module. Collects usage data and applies pricing rules. - **[[Koku]]** — Cost management from Red Hat. More focused on cost visibility than invoicing. - In practice, you'll likely build custom metering on top of Prometheus metrics and pipe it into CloudKitty or a homegrown billing system. > **Arrow of Progress:** PowerDNS resolves → HAProxy/Envoy routes → CloudKitty meters → tenants get invoiced --- ## The Minimum Viable Hyperscaler If you're building this tomorrow with a small team, here's the shortest path to something real: ``` MAAS → OpenStack (KVM) → Ceph → OVN/OVS → Kubernetes → Prometheus/Grafana → Keycloak → Ansible/OpenTofu → HAProxy → PowerDNS → CloudKitty ``` That gives you bare metal provisioning, IaaS, unified storage, overlay networking, container orchestration, monitoring, identity, automation, load balancing, DNS, and basic billing. Everything else layers on as you grow. --- ## What's Missing from Open Source Honest gaps where you'd need to build custom or go commercial: - **Billing & invoicing** — CloudKitty is a start but not a full billing platform - **Marketplace / service catalog** — No good open source equivalent of the AWS Marketplace - **Global traffic management** — Anycast + GeoDNS at global scale usually needs custom work - **Managed database services** — You can run PostgreSQL, MySQL, etc., but the "managed" wrapper (automated backups, scaling, failover) is custom engineering - **Support portals & SLA management** — Operational, not technical, but still needed --- ## Related Notes - [[Data Center MoC]] - [[Navon MoC]] - [[Bare Metal]] - [[VMs]] - [[Docker Containers]] - [[Clustering]] - [[Scheduling]] - [[MIGs]] --- *This MoC is a living document. As open source tooling evolves — and it moves fast — the specific recommendations here will shift. The layers won't.*