
### Understanding the AI agents landscape
The agent software ecosystem has developed significantly in the past few months with progress in memory, tool usage, secure execution, and deployment. The AI agents stack in late 2024, organized into three key layers: agent hosting/serving, agent frameworks, and LLM models & storage.

### From LLMs to LLM agents
In 2022 and 2023 we saw the rise of LLM frameworks and SDKs such as [LangChain](https://github.com/langchain-ai/langchain) (released in Oct 2022) and [LlamaIndex](https://github.com/run-llama/llama_index) (released in Nov 2022). Simultaneously we saw the establishment of several “standard” platforms for consuming LLMs via APIs as well as self-deploying LLM inference ([vLLM](https://github.com/vllm-project/vllm) and [Ollama](https://github.com/ollama/ollama)).
In 2024, we’ve seen a dramatic shift in interest towards AI “agents”, and more generally, compound systems. Despite having existed as a term in AI for decades (specifically [reinforcement learning](https://en.wikipedia.org/wiki/Reinforcement_learning)), “agents” has become a loosely defined term in the post-ChatGPT era, often referring to LLMs that are tasked with outputting actions (tool calls) and that run in an autonomous setting. The combination of tools use, autonomous execution, and memory required to go from LLM → agent has necessitated a new **_agents stack_** to develop.
### What makes the agent stack unique?
Agents are a significantly harder engineering challenge compared to basic LLM chatbots because they require **state management** (retaining the message/event history, storing long-term memories, executing multiple LLM calls in an agentic loop) and **tool execution** (safely executing an action output by an LLM and returning the result).
As a result, the AI agents stack looks very different from the standard LLM stack. Let’s break down today’s AI agents stack, starting from the bottom at the model serving layer:
### Model serving

At the core of AI agents is the LLM. To use the LLM, the model needs to be served via an inference engine, most often run behind a paid API service.
[OpenAI](https://openai.com/) and [Anthropic](https://anthropic.com/) lead among the closed API-based model inference providers with private frontier models. [Together.AI](http://together.ai/), [Fireworks](https://fireworks.ai/), and [Groq](https://groq.com/) are popular options that serve open-weights models (e.g. Llama 3) behind paid APIs. Among the local model inference providers, we most commonly see [vLLM](https://github.com/vllm-project/vllm) leading the pack for production-grade GPU-based serving loads. [SGLang](https://github.com/sgl-project/sglang) is an up-and-coming project with a similar developer audience. Among hobbyists (”AI enthusiasts”), [Ollama](https://github.com/ollama/ollama) and [LM Studio](https://lmstudio.ai/) are two popular options for running models on your own own computer (eg M-series Apple Macbooks).
### Storage

Storage is a fundamental building block for agents which are _stateful -_ agents are defined by persisted state like their conversation history, memories, and external data sources they use for RAG. Vector databases like [Chroma](https://github.com/chroma-core/chroma), [Weaviate](https://github.com/weaviate/weaviate), [Pinecone](https://www.pinecone.io/), [Qdrant](https://github.com/qdrant/qdrant), and [Milvus](https://github.com/milvus-io/milvus) are popular for storing the “external memory” of agents, allowing agents to leverage data sources and conversational histories far too large to be placed into the context window. [Postgres](https://www.postgresql.org/), a traditional DB that’s been around since the 80’s, also now supports vector search via the [pgvector](https://github.com/pgvector/pgvector) extension. Postgres-based companies like [Neon](https://neon.tech/) (serverless Postgres) and [Supabase](https://supabase.com/) also offer embedding-based search and storage for agents.
### Tools & libraries

One of the primary differences between standard AI chatbots and AI agents is the ability of an agent to call “tools” (or “functions”). In most cases the mechanism for this action is the LLM generating structured output (e.g. a JSON object) that specifies a function to call and arguments to provide. A common point of confusion with agent tool execution is that the tool execution is _not_ done by the LLM provider itself - the LLM only chooses what tool to call, and what arguments to provide. An agent service that supports arbitrary tools or arbitrary arguments into tools must use sandboxes (e.g. [Modal](https://modal.com/), [E2B](https://github.com/e2b-dev/E2B)) to ensure secure execution.
Agents all call tools via a [JSON schema defined by OpenAI](https://platform.openai.com/docs/guides/function-calling) - this means that agents and tools can actually be compatible across different frameworks. [Letta](https://github.com/letta-ai/letta) agents can call [LangChain](https://python.langchain.com/docs/how_to/#tools), [CrewAI](https://github.com/crewAIInc/crewAI-tools), and [Composio](https://composio.dev/tools) tools, since they are all defined by the same schema. Because of this, there is a growing ecosystem of tool providers for common tools. [Composio](https://composio.dev/) is a popular general-purpose library for tools that also manages authorization. [Browserbase](https://www.browserbase.com/) is an example of a specialized tool for web browsing, and [Exa](https://exa.ai/) provides a specialized tool for searching the web. As more agents are built, we expect the tool ecosystem to grow and also provide existing new functionalities like authentication and access control for agents.
### Agent frameworks

Agent frameworks orchestrate LLM calls and manage agent state. Different frameworks will have different designs for:
**Management of agent’s state**: Most frameworks have introduced some notion of “serialization” of state, that allows agents to be loaded back into the same script at a later time by saving the serialized state (e.g. JSON, bytes) to a file — this includes state like the conversation history, agent memories, and stage of execution. In [Letta](https://github.com/letta-ai/letta), where all state is backed by a database (e.g. a messages table, agent state table, memory block table) there is no notion of “serialization” since agent state is always persisted. This allows for easily querying agent state (for example, looking up past messages by date). How state is represented and managed determines both how the agents application will be able to scale with longer conversational histories or larger numbers of agents, as well as how flexibly state can be accessed or modified over time.
**Structure of the agent’s context window**: Each time the LLM is called, the framework will “compile” the agent’s state into the context window. Different frameworks will place data into the context window (e.g. the instructions, message buffer, etc.) in different ways, which can alter performance. We recommend choosing a framework that makes the context window transparent, since this ultimately is how you can control the behavior of your agents.
**Cross-agent communication (i.e. multi-agent)**: Llama Index has agents communicate via message queues, while [CrewAI](https://github.com/crewAIInc/crewAI) and [AutoGen](https://github.com/microsoft/autogen) have explicit abstractors for multi-agent. [Letta](https://github.com/letta-ai/letta) and [LangGraph](https://github.com/langchain-ai/langgraph) both support agents directly calling each other, which allows for both centralized (via a supervisor agent) and distributed communication across agents. Most frameworks now support both multi-agent and single-agent, since a well-designed single-agent system should make cross-agent collaboration easily implementable.
**Approaches to memory**: A fundamental limit to LLMs is their limited context window, which necessitates techniques to manage memory over time. Memory management is built-in to some frameworks, while others expect developers to manage memory themselves. CrewAI and AutoGen rely solely on RAG-based memory, while [phidata](https://github.com/phidatahq/phidata) and Letta use additional techniques like self-editing memory (from [MemGPT](https://research.memgpt.ai/)) and recursive summarization. Letta agents automatically come with a set of memory management tools that allow agents to search previous messages by text or data, write memories, and edit the agent’s own context window (you can read more [here](https://docs.letta.com/letta_memgpt)).
**Support for open models**: Model providers actually do a lot of behind-the-scenes tricks to get LLMs to generate text in the right format (e.g. for tool calling) — for example, re-sampling the LLM outputs when they don’t generate proper tool arguments, or adding hints into the prompt (e.g. “pretty please output JSON”). Supporting open models requires the framework to handle these challenges, so some limit support to major model providers.
When building agents today, the right choice of framework depends on your application, such as whether you are building a conversational agent or workflow, whether you want to run agents in a notebook or as a service, and your requirements for open weights model support.
We expect major differentiators to arise between frameworks in their deployment workflows, where design choices for state/memory management and tool execution become more significant.
### Agent hosting and agent serving

Most agent frameworks today are designed for agents that don’t exist outside of the Python script or Jupyter notebook they were written in. We believe that the future of agents is to treat agents as a _service_ that is _deployed_ to on-prem or cloud infrastructure, accessible via REST APIs. In the same way that OpenAI’s `ChatCompletion` API became the industry standard for interacting with LLM services, we expect that there will eventually be a winner for the Agents API. But there isn’t one… yet.
Deploying **agents** as a service is much trickier than deploying **LLMs** as a service, due to the issues of state management and secure tool execution. Tools and their required dependencies and environment needs to be explicitly stored in a DB, since the environment to run them needs to be re-created by the service (which is not an issue when your tools and agents are running in the same script). Applications may need to run millions of agents, each accumulating growing conversational histories. When moving from prototyping to production, inevitably agent state must go through a data normalization process, and agent interactions must be defined by REST APIs. Today, this process is usually done by developers writing their own FastAPI and database code, but we expect that this functionality will become more embedded into frameworks as agents mature.