Thoughts on AI Inference Infrastructure

I've been going down a rabbit hole on [[Inference]] infrastructure lately and there's something interesting happening with how the market's splitting up. It maps pretty directly to the "utilization economics" and the [[Puzzle of low data center utilisation]], I've been thinking about. ## How the market's organizing itself The infrastructure layer is basically bifurcating: 1. **Reserved compute** (CoreWeave, Lambda, NEBIUS, Crusoe) - this is your classic hourly GPU rental. You get predictable capacity, full control, deterministic performance. Smaller number of customers but bigger contracts. 2. **Inference platforms** (Modal, baseten) - abstraction layer sitting between raw metal and APIs. 3. **Marketplace plays** (SF Compute Company, vast.ai) - matching supply and demand for reserved capacity. 4. **Inference APIs** - this is where it gets crowded. You've got Fireworks AI, together.ai, deepinfra, plus fal doing the modality-focused thing. And obviously the hyperscalers: Vertex AI, Bedrock, Azure AI Studio. ![[Screenshot 2025-12-17 at 22.35.01.png]] ## The non-obvious bit: utilization matters more than raw speed I strongly feel that **utilization variance** actually matters more than **performance variance.** Most systems-level optimizations (kernel tuning, batching, KV cache tricks) spread through open source pretty quickly. What _doesn't_ spread is the **operational ability** to keep GPUs busy by aggregating lots of different workloads. Which is why: - The fastest inference API is rarely the cheapest when you actually run production workloads - A platform running at 75-80% utilization will beat a "faster" system stuck at 45-50% - Reserved compute and inference APIs are both viable - just solving for completely different customer problems ![[Screenshot 2025-12-17 at 22.34.13.png]] ## So what? ![[Screenshot 2025-12-17 at 22.34.04.png]] The durable edge goes to platforms that can offer both rails: reserved capacity for the workloads that need to be pinned, plus API/burst capacity for everything else. High utilization across both. That's the thesis behind a sovereign AI factory model. **Time-to-power** is the moat. High-density liquid cooling. Enterprise and public-sector tenancy built in from the start, not bolted on later. The trick is sustaining those utilization levels across both types of demand. That's what's actually defensible. ----