Mixture of Experts (MoE)

# Mixture of Experts (MoE) Parent: [[Mixture of Experts & Adapter Architectures]] A transformer architecture where each feed-forward block is replaced by a set of N independent "expert" sub-networks plus a router that decides which experts see each token. Only k of the N experts fire per token, typically k=1 or k=2. The other experts are inert for that token. The economic logic: you can have a very large total parameter count (good for capability) while keeping the active parameter count per token small (good for inference cost). Mixtral 8x7B has roughly 47B total parameters but only ~13B active per token. The model behaves like a 13B for compute and a 47B for knowledge. The router is the interesting part. It is typically a small linear layer that takes the token's hidden state and produces N logits; top-k gating picks the winners. The routing is learned end-to-end alongside everything else, with auxiliary losses to prevent load collapse — the failure mode where the router funnels everything to one or two favourite experts and the rest atrophy. MoE pays for its capacity advantage in two ways. First, memory: even inactive experts must sit in memory, so VRAM requirements scale with total parameters, not active parameters. Second, routing complexity: at batch-size-1 interactive inference, different tokens want different experts and you cannot amortise their weights across a batch, which hurts latency. Distinct from adapter-based specialisation. MoE experts are trained jointly with the base model. LoRA adapters are trained against a frozen base. ## Related - [[LoRA (Low-Rank Adaptation)]] - [[Parameter-Efficient Fine-Tuning (PEFT)]] - [[Multi-Tenant Model Serving]] --- Tags: #ai #moe #kp