**MIG (Multi-Instance GPU)** is an NVIDIA technology (introduced with the A100 architecture) that allows a single physical GPU to be partitioned into up to seven independent instances — each with its own dedicated compute resources, memory, and memory bandwidth. Each MIG instance behaves like a smaller, fully isolated GPU.
---
### **First Principle: GPU utilisation is the most expensive metric to waste.**
A single A100 or H100 GPU costs thousands of dollars and draws 300–700W of power. If a workload only needs 20% of that GPU's capacity, the remaining 80% sits idle unless the GPU can be partitioned. MIG turns one expensive GPU into multiple smaller, independently usable GPUs — dramatically improving [[multi-tenancy|utilisation]] in shared environments.
---
### Key Considerations
- **Hardware Partitioning**: Unlike software-based GPU sharing (MPS, time-slicing), MIG provides **hardware-level isolation**. Each instance has guaranteed compute (Streaming Multiprocessors), memory, and memory bandwidth. One instance cannot interfere with another — eliminating the [[Noisy Neighbour Problem|noisy neighbour problem]].
- **Instance Profiles**: An A100 (80GB) can be sliced into profiles like 1g.10gb, 2g.20gb, 3g.40gb, or 7g.80gb — where the first number is compute slices and the second is memory. The operator chooses the partition scheme based on workload mix.
- **Inference Sweet Spot**: MIG is primarily valuable for inference workloads, where individual requests need only a fraction of a full GPU. Training workloads typically need full GPUs (or multiple GPUs via [[Clustering|clusters]]).
- **Scheduler Integration**: [[Scheduling|Schedulers]] like Kubernetes (with the NVIDIA device plugin) and SLURM can allocate individual MIG instances to different [[Docker Containers|containers]] or jobs, treating each partition as a schedulable GPU resource.
---
### Actionable Insights
For [[Modular Data Center Design Principles|modular data centers]] serving mixed inference workloads, MIG is the key to economic viability. Without MIG, small inference jobs each consume an entire GPU — driving utilisation below 30%. With MIG, a single H100 can serve 3–7 independent inference workloads simultaneously, pushing utilisation above 70%. When designing the [[Scheduling|scheduling layer]], ensure the orchestrator supports MIG-aware allocation so that instances are right-sized to actual workload demand rather than over-provisioned.
---
### MIG in the Abstraction Stack
```
[[VLSI]] (transistors)
→ [[Bare Metal]] (physical server)
→ [[VMs]] (hardware virtualisation)
→ [[Docker Containers]] (OS-level virtualisation)
→ MIG (GPU partitioning) ← you are here
```
MIG is the finest granularity of GPU resource allocation — the bottom of the **workload slicing** progression: Whole Server → VMs → Containers → MIG.
---
### MIG Partition Example (A100 80GB)
| Profile | Compute Slices | Memory | Use Case |
|---------|---------------|--------|----------|
| 7g.80gb | 7/7 (full GPU) | 80 GB | Training, large inference |
| 4g.40gb | 4/7 | 40 GB | Medium inference models |
| 2g.20gb | 2/7 | 20 GB | Small model serving |
| 1g.10gb | 1/7 | 10 GB | Lightweight inference, dev |
[[Docker Containers]] | [[VMs]] | [[Bare Metal]] | [[Scheduling]] | [[multi-tenancy]] | [[Noisy Neighbour Problem]] | [[Inference]]