I've been going down a rabbit hole on [[Inference]] infrastructure lately and there's something interesting happening with how the market's splitting up. It maps pretty directly to the "utilization economics" and the [[Puzzle of low data center utilisation]], I've been thinking about.
## How the market's organizing itself
The infrastructure layer is basically bifurcating:
1. **Reserved compute** (CoreWeave, Lambda, NEBIUS, Crusoe) - this is your classic hourly GPU rental. You get predictable capacity, full control, deterministic performance. Smaller number of customers but bigger contracts.
2. **Inference platforms** (Modal, baseten) - abstraction layer sitting between raw metal and APIs.
3. **Marketplace plays** (SF Compute Company, vast.ai) - matching supply and demand for reserved capacity.
4. **Inference APIs** - this is where it gets crowded. You've got Fireworks AI, together.ai, deepinfra, plus fal doing the modality-focused thing. And obviously the hyperscalers: Vertex AI, Bedrock, Azure AI Studio.
![[Screenshot 2025-12-17 at 22.35.01.png]]
## The non-obvious bit: utilization matters more than raw speed
I strongly feel that **utilization variance** actually matters more than **performance variance.**
Most systems-level optimizations (kernel tuning, batching, KV cache tricks) spread through open source pretty quickly. What _doesn't_ spread is the **operational ability** to keep GPUs busy by aggregating lots of different workloads.
Which is why:
- The fastest inference API is rarely the cheapest when you actually run production workloads
- A platform running at 75-80% utilization will beat a "faster" system stuck at 45-50%
- Reserved compute and inference APIs are both viable - just solving for completely different customer problems
![[Screenshot 2025-12-17 at 22.34.13.png]]
## So what?
![[Screenshot 2025-12-17 at 22.34.04.png]]
The durable edge goes to platforms that can offer both rails: reserved capacity for the workloads that need to be pinned, plus API/burst capacity for everything else. High utilization across both.
That's the thesis behind a sovereign AI factory model. **Time-to-power** is the moat. High-density liquid cooling. Enterprise and public-sector tenancy built in from the start, not bolted on later.
The trick is sustaining those utilization levels across both types of demand. That's what's actually defensible.
----