Future of Foundational Models

#foundationalmodels # Domain Specific Noteworthy Models - DiffDock - EquiBind - ESMFold # How how do generative AI companies create defensibility? - Distribution Advantages - acquired / trust based ownership / influence over large audiences - Network Effects - creating [[First scaler advantages]] # Where is the value capture? Today we have thin-product layers sitting atop powerful, external models. These product layers name a nicer UI / customer experience and facilitate an interface for a specialised model (eg. healthcare) or a particular functionality (eg. Chrome plugin) These are not AI companies, in the same way that DTC brands were not tech businesses. And we will see a proliferation of customer uptake however a big lag in true revenue generation following through. # What are the computational challenges? - The computational [complexity](https://arxiv.org/pdf/2007.05558.pdf) of state-of-the-art Artificial Intelligence (AI) systems is [doubling every 3 months](https://openai.com/blog/ai-and-compute/), vastly outstripping compute supply. - Since 2012, this metric of compute needed by AI training has grown by more than 300,000x (a 2-year doubling period would yield only a 7x increase). GPT-3 175B, the largest GPT-3 model proposed by OpenAI in [Brown et al. (2020)](https://arxiv.org/abs/2005.14165) used a cluster of 1,000 NVIDIA Tesla V100 GPUs for training - roughly equivalent to 355 years of training on a single device. ![[Pasted image 20230219012114.png]] ![[Pasted image 20230219011813.png]] ![[Pasted image 20230219015121.png]] > Hardware & Algo Innovation is needed that that helps us increase FLOPS/watt or FLOPS/$ Ref: - https://www.fast.ai/ Reconfiguring the same hardware to get better results: https://www.fast.ai/posts/2018-04-30-dawnbench-fastai.html - https://www.nytimes.com/2018/01/14/technology/artificial-intelligence-chip-start-ups.html ### New Hardware Innovators - [Horizon Robits](https://en.horizon.ai/products/bpu-brain-processing-unit-engine/) - Autonomous Driving Chips, Brain Processing Units - [Xilinx x DeePhi](https://www.xilinx.com/publications/events/developer-forum/2018-frankfurt/xilinx-machine-learning-strategies-with-deephi-tech.pdf) - https://mythic.ai/ - https://www.cerebras.net/ - https://www.graphcore.ai/ - Intelligence Processing Units (IPUs) Cost will eventually **limit the parallelism** side of the trend and **physics will limit the chip efficiency** side. OpenAI believes the largest training runs today **employ hardware that cost in the single digit millions of dollars to purchase** (although the amortized cost is much lower). But the majority of neural net compute today is still spent on inference (deployment), not training, meaning companies can repurpose or afford to purchase much larger fleets of chips for training. Therefore, if sufficient economic incentive exists, we could see even more massively parallel training runs, and thus the continuation of this trend for several more years. The world’s total hardware budget is [1 trillion dollars](https://www.statista.com/statistics/422802/hardware-spending-forecast-worldwide/) a year, so absolute limits remain far away. Overall, given the data above, the precedent for exponential trends in computing, work on ML specific hardware, and the economic incentives at play, we think it’d be a mistake to be confident this trend won’t continue in the short term. ## Eras Looking at the graph we can roughly see four distinct eras: - Before 2012: It was uncommon to use GPUs for ML, making any of the results in the graph difficult to achieve. - 2012 to 2014: Infrastructure to train on many GPUs was uncommon, so most results used 1-8 GPUs rated at 1-2 TFLOPS for a total of 0.001-0.1 pfs-days. - 2014 to 2016: Large-scale results used 10-100 GPUs rated at 5-10 TFLOPS, resulting in 0.1-10 pfs-days. Diminishing returns on data parallelism meant that larger training runs had limited value. - 2016 to 2017: Approaches that allow greater algorithmic parallelism such as [huge batch sizes](https://arxiv.org/abs/1711.04325), [architecture search](https://arxiv.org/abs/1611.01578), and [expert iteration](https://arxiv.org/pdf/1705.08439.pdf), along with specialized hardware such as TPU’s and faster interconnects, have greatly increased these limits, at least for some applications. ![[Pasted image 20230219003911.png]] ![[Pasted image 20230219004124.png]] # How much do training these models cost today? ![[Pasted image 20230219004743.png]]