Clustering - Mind Palace

**Clustering** is the practice of connecting multiple computers (nodes) so they operate as a single unified system — pooling compute, memory, and storage resources to handle workloads that exceed what any single machine can deliver. In data center contexts, clusters are the fundamental unit of scale for AI training, HPC, and large-scale inference. --- ### First Principle: No single machine is big enough for the hardest problems. Modern AI models require more memory, compute, and bandwidth than any single server provides. Clustering solves this by distributing work across many nodes connected by high-speed networks, making the cluster behave as one large virtual computer. --- ### Key Considerations - **Cluster Interconnect**: The network between nodes is the critical bottleneck. **InfiniBand** (400 Gb/s+) and **RoCE** (RDMA over Converged Ethernet) provide the low-latency, high-bandwidth fabric needed for distributed training. See [[NVMe Fabric]] and [[NIC - Network Interface Cards]]. - **Homogeneity vs Heterogeneity**: Training clusters are typically homogeneous (identical nodes) for predictable performance. Inference clusters can be heterogeneous, mixing GPU and CPU nodes based on workload. - **Failure Domains**: In a cluster of hundreds of nodes, hardware failures are routine — not exceptional. The [[Scheduling|scheduler]] must detect failures and redistribute work automatically. - **Shared Storage**: Clusters need shared or distributed storage (parallel filesystems like Lustre, GPFS, or object stores) so all nodes can access training data and checkpoints. - **Scaling Laws**: Cluster performance does not scale linearly with node count. Communication overhead grows with cluster size, making network topology and [[The Data Center is the Computer - How the Network Shapes Performance|network design]] critical. --- ### Actionable Insights When designing a cluster for a [[Modular Data Center Design Principles|modular data center]], the network architecture matters as much as the compute hardware. A cluster of 64 GPUs connected by 100GbE will dramatically underperform the same 64 GPUs connected by 400Gb InfiniBand — the difference can be 2–5× in distributed training throughput. Budget for the interconnect accordingly, and design the physical layout (rack placement, cable runs) to minimise network hops between nodes in the same job. --- ### Cluster Architecture Pattern ``` ┌─────────────────────────┐ │ Scheduler / │ │ Orchestrator │ │ ([[Scheduling]]) │ └────────┬────────────────┘ │ ┌─────────────┼─────────────┐ │ │ │ ┌───┴───┐ ┌───┴───┐ ┌───┴───┐ │Node 1 │ │Node 2 │ │Node N │ │(GPU×8)│ │(GPU×8)│ │(GPU×8)│ └───┬───┘ └───┬───┘ └───┬───┘ │ │ │ └─────────────┼─────────────┘ High-Speed Fabric (InfiniBand / RoCE) ``` [[Scheduling]] | [[Bare Metal]] | [[Clusters vs Instances]] | [[NVMe Fabric]] | [[NIC - Network Interface Cards]] | [[The Data Center is the Computer - How the Network Shapes Performance]]