**MIG (Multi-Instance GPU)** is an NVIDIA technology (introduced with the A100 architecture) that allows a single physical GPU to be partitioned into up to seven independent instances — each with its own dedicated compute resources, memory, and memory bandwidth. Each MIG instance behaves like a smaller, fully isolated GPU. --- ### **First Principle: GPU utilisation is the most expensive metric to waste.** A single A100 or H100 GPU costs thousands of dollars and draws 300–700W of power. If a workload only needs 20% of that GPU's capacity, the remaining 80% sits idle unless the GPU can be partitioned. MIG turns one expensive GPU into multiple smaller, independently usable GPUs — dramatically improving [[multi-tenancy|utilisation]] in shared environments. --- ### Key Considerations - **Hardware Partitioning**: Unlike software-based GPU sharing (MPS, time-slicing), MIG provides **hardware-level isolation**. Each instance has guaranteed compute (Streaming Multiprocessors), memory, and memory bandwidth. One instance cannot interfere with another — eliminating the [[Noisy Neighbour Problem|noisy neighbour problem]]. - **Instance Profiles**: An A100 (80GB) can be sliced into profiles like 1g.10gb, 2g.20gb, 3g.40gb, or 7g.80gb — where the first number is compute slices and the second is memory. The operator chooses the partition scheme based on workload mix. - **Inference Sweet Spot**: MIG is primarily valuable for inference workloads, where individual requests need only a fraction of a full GPU. Training workloads typically need full GPUs (or multiple GPUs via [[Clustering|clusters]]). - **Scheduler Integration**: [[Scheduling|Schedulers]] like Kubernetes (with the NVIDIA device plugin) and SLURM can allocate individual MIG instances to different [[Docker Containers|containers]] or jobs, treating each partition as a schedulable GPU resource. --- ### Actionable Insights For [[Modular Data Center Design Principles|modular data centers]] serving mixed inference workloads, MIG is the key to economic viability. Without MIG, small inference jobs each consume an entire GPU — driving utilisation below 30%. With MIG, a single H100 can serve 3–7 independent inference workloads simultaneously, pushing utilisation above 70%. When designing the [[Scheduling|scheduling layer]], ensure the orchestrator supports MIG-aware allocation so that instances are right-sized to actual workload demand rather than over-provisioned. --- ### MIG in the Abstraction Stack ``` [[VLSI]] (transistors) → [[Bare Metal]] (physical server) → [[VMs]] (hardware virtualisation) → [[Docker Containers]] (OS-level virtualisation) → MIG (GPU partitioning) ← you are here ``` MIG is the finest granularity of GPU resource allocation — the bottom of the **workload slicing** progression: Whole Server → VMs → Containers → MIG. --- ### MIG Partition Example (A100 80GB) | Profile | Compute Slices | Memory | Use Case | |---------|---------------|--------|----------| | 7g.80gb | 7/7 (full GPU) | 80 GB | Training, large inference | | 4g.40gb | 4/7 | 40 GB | Medium inference models | | 2g.20gb | 2/7 | 20 GB | Small model serving | | 1g.10gb | 1/7 | 10 GB | Lightweight inference, dev | [[Docker Containers]] | [[VMs]] | [[Bare Metal]] | [[Scheduling]] | [[multi-tenancy]] | [[Noisy Neighbour Problem]] | [[Inference]]