DeepSeek Technical Differentiators

## What really sets it apart? - **Reinforcement Learning (RL)-driven training**: Unlike previous models, **DeepSeek-r1** uses RL extensively, but not with human-curated examples. Instead, the model generates its own tasks, solves them (e.g., solving an equation step by step), and evaluates the correctness of the answer _and_ the thought process. This creates a model that excels in **reasoning tasks** but remains comparable in general knowledge or intuitive tasks. - **Multi-token prediction**: Predicting multiple tokens instead of one at a time effectively **doubles inference speed**, pushing efficiency to new heights. - **Mixture of Experts model**: By breaking a large model into smaller, specialized ones, DeepSeek ensures it can run on **consumer-grade GPUs**, democratizing AI access. - **8-bit over 32-bit precision**: Using **8-bit floating point numbers** delivers **massive memory savings** without performance trade-offs. - **Key-value index compression**: Achieving an impressive **93% compression ratio**, freeing up VRAM and improving scalability. But the biggest shift is philosophical: DeepSeek’s RL-driven approach changes how models are trained. We’re no longer constrained by limited human-curated data. For tasks where results can be validated (e.g. math problems), training is now bound only by **compute power**. This breaks the **data bottleneck narrative**, opening doors for unprecedented scalability. Now here’s where things get interesting: as we innovate for efficiency, [[Jevons Paradox]] comes in here. Increased AI efficiency doesn’t reduce resource consumption—it accelerates it. The easier it becomes to train smarter, more efficient models, the more demand grows for compute power. **The implications are massive.** NVIDIA’s valuation paradoxically drops despite this compute-driven future. The market doesn’t seem to get it.