Multi-Tenant Model Serving

# Multi-Tenant Model Serving Parent: [[Mixture of Experts & Adapter Architectures]] Serving many distinct customer or workload variants from a single shared base model, with per-tenant specialisation layered on top. The architectural unlock for commercial AI infrastructure. Without it, every tenant needs a dedicated GPU. With it, one GPU can serve hundreds or thousands of tenants at bounded incremental cost. The pattern is: one large base model pinned in GPU memory, plus a pool of small tenant-specific adapters (typically LoRA) loaded on demand. At request time, the server routes the request to the base model, looks up the tenant's adapter, applies it to the forward pass, and returns. When the tenant goes idle, the adapter is evicted from the active cache; when they come back, it is reloaded. The critical engineering questions are about switching cost and batching. Can you swap adapters between requests without reloading the base? (Yes, if the framework supports it.) Can you batch requests from different tenants using different adapters in the same forward pass? (Harder — this is what S-LoRA and similar systems are built to enable.) What is the tail latency when an adapter has to cold-load from disk? Unit economics live and die on cache hit rate. If hot adapters stay resident and cold adapters are rare, per-tenant serving cost approaches the base model's marginal inference cost. If cold-loads dominate, you are effectively running separate models and the multi-tenant story collapses. ## Related - [[LoRA (Low-Rank Adaptation)]] - [[Parameter-Efficient Fine-Tuning (PEFT)]] - [[Mixture of Experts (MoE)]] --- Tags: #ai #serving #infra #kp