TPU is Google's Tensor Processing Unit, an [[ASIC]] designed specifically for machine learning workloads.
Google built TPUs because they had unique scale requirements and wanted to optimize costs for their internal workloads. First generation focused on [[Inference]], later versions added training. TPU v5 now powers most of Google's AI infrastructure.
The key innovation: optimizing the entire chip for matrix multiplication and tensor operations. No legacy graphics baggage like GPUs carry. This makes TPUs very efficient for standard deep learning operations but inflexible for anything outside that domain.
TPUs only survive because Google has massive captive demand. They can amortize the $300M+ development cost across billions of searches, YouTube recommendations, and Gemini queries. Most companies can't justify this investment.
The broader point: TPUs prove custom silicon can work for AI, but only at hyperscale. Nvidia's ecosystem advantages (CUDA, developer tools, third-party software) make it nearly impossible for others to compete without comparable captive demand. This is why only Google's TPU, AWS's Trainium/Inferentia, and potentially Meta's chips survive long-term.
Even with TPUs, Google still uses massive amounts of Nvidia GPUs. TPUs handle their standard workloads. Nvidia handles research, non-standard models, and anything requiring flexibility.
---
#deeptech #inference
Related: [[ASIC]] | [[Inference]] | [[HBM]] | [[Nvidia-Groq - Inference Disaggregation Play]]