Inference is Eating AI Compute

[[Inference]] is overtaking training as the dominant AI compute workload. McKinsey projects inference will be the majority of AI compute by end of decade, even as per-token costs keep falling. The economics shift from episodic capex (training runs) to recurring opex that scales linearly with adoption. This changes everything about how you think about [[AI Inference Infrastructure]]. ## Training vs. Inference Economics Training is a capex event. You spend big, once, to produce a model. The cost is front-loaded and predictable. It's a project. Inference is opex. Every user query, every agent call, every API hit costs money. The spend compounds as adoption grows. It's a utility bill. And unlike training, inference demand is bursty, unpredictable, and driven by end-user behaviour you can't forecast. This is why [[The infrastructure layer and AI capex]] framing misses the second act. The capex surge builds training capacity. But the recurring cost that actually scales with enterprise AI adoption is inference. As more workflows, teams, and products incorporate models, inference becomes the dominant line item. ## Why Shared Infrastructure Breaks Most enterprises start with multi-tenant inference APIs. Fast, cheap, zero operational overhead. Then they hit the [[Noisy Neighbour Problem]]: rate limits during peak usage, unpredictable latency, no configurability, compliance gaps. Shared infrastructure optimizes for the platform's economics, not yours. Self-hosting solves the control problem but creates an operational nightmare. Provisioning GPUs, managing upgrades, handling scaling, maintaining availability. Capital and talent intensive. The [[Convenience-Control Tradeoff]] bites hard. The market response: managed isolation. Dedicated, logically isolated inference infrastructure operated by the model vendor. You own the control plane (agent logic, workflows, data). They own the data plane (GPU provisioning, model serving, scaling). No shared resources, no contention, but no operational burden either. [[multi-tenancy]] without the multi-tenant downsides. ## Agentic Workloads Make This Harder Agentic AI workloads are bursty, multifaceted, and unpredictable by design. An agent chains multiple model calls, hits tools, waits, fires again. Traditional capacity planning assumes smooth demand curves. Agents produce spiky, irregular ones. You can't pre-provision for agentic inference the way you could for batch training or steady-state API calls. Elastic, on-demand inference becomes table stakes. This is exactly the [[Puzzle of low data center utilisation]] playing out at the inference layer: facilities designed for peak capacity sit underutilized most of the time, and the utilization variance problem from my [[AI Inference Infrastructure]] analysis gets worse with agentic patterns. ## Inference is Not Your Differentiator For most enterprises, model serving is undifferentiated heavy lifting. Your edge is in agent logic, workflow orchestration, data pipelines, domain expertise. The inference layer wants to become a commodity utility, like compute and storage before it. This has two investment implications: 1. The durable value in inference accrues to platforms that can aggregate workloads and maintain high utilization across both reserved and burst capacity. Utilization economics matter more than raw speed, exactly as noted in [[AI Inference Infrastructure]]. 2. The real opportunity is in the layers above inference: the orchestration, observability, and tooling that turns raw model access into production AI systems. [[Data Center Software]] for the inference era. ## The Observability Gap As inference scales, teams need visibility into request patterns, latency distributions, token throughput, resource utilization. You can't optimize what you can't measure. This mirrors the broader shift toward the [[Data Center Master Metrics]] mindset but applied at the inference layer. The same way [[Power Usage Effectiveness - PUE]] became the standard metric for data center efficiency, inference needs its own operational metrics stack. ## Connecting the Dots The [[Nvidia-Groq - Inference Disaggregation Play]] is the hardware story: specialized architectures for [[Prefill and Decode]], optimized for different workload patterns. This note is the deployment and economics story: how those specialized chips get served to enterprises at scale. Both converge on the same conclusion. Inference is disaggregating, both architecturally (hardware specialization) and operationally (managed isolation replacing shared APIs). The winners are the platforms that combine both: optimal hardware for each workload pattern, delivered through dedicated infrastructure with elastic scaling. The [[The AI Stack - Building Blocks]] needs a new layer in the picture. Between raw infrastructure and applications, there's an inference serving layer that's becoming its own market. That's where the next wave of infrastructure value creation happens. --- #deeptech #investing #kp Related: [[Inference]] | [[AI Inference Infrastructure]] | [[Nvidia-Groq - Inference Disaggregation Play]] | [[Prefill and Decode]] | [[Noisy Neighbour Problem]] | [[Convenience-Control Tradeoff]] | [[multi-tenancy]] | [[Puzzle of low data center utilisation]] | [[The infrastructure layer and AI capex]] | [[Data Center Software]] | [[Data Center Master Metrics]] | [[Power Usage Effectiveness - PUE]] | [[The AI Stack - Building Blocks]] | [[Data Center MoC]]