Implementing a Semantic Cache for your LLM app with CosmosDB

An implementation with LangChain and MongoDB API

11 min readJul 15, 2024

Over the last months, many LLMs-powered applications converged into the framework of AI Agents, that can be defined as specialized entities designed to address user’s query and perform tasks. More specifically, an AI Agent present the following features:

It is powered by an LLM (or SML) which act as “brain” of the application. This means that Agents are capable of planning a sequence of action to accomplish user’s query.
It has access to a set of tools or plug-ins to execute actions in the surrounding ecosystem (e.g. querying data in an ERP, sending an email with a generated newsletter, posting a picture on Linkedin…)
With the advent of multimodality, an AI Agent can now perceive the surrounding environment in all its elements. For example, an AI agent could see a car damage due to an accident, say it loud to the car owner and send a claim to the insurance company.
It has the ability to remember its past interactions and behaviours, which are also leveraged for future improvement. This component is called memory (more specifically, short-term memory).

Let’s focus on memory.

In fact, there are three main kind of data that we might want to store when it comes to AI Agents:

Conversational interactions →we want to keep a context window for our Agent to remember certain things, whether they are full conversation with users, or specific question&answering pairs that might be recurring. In this case, the best practice is that of using a in-memory database.

Note: An in-memory database (IMDB) is a type of database management system that primarily relies on main memory (RAM) for data storage, as opposed to traditional databases that store data on disk. This approach allows for much faster data retrieval and processing, making in-memory databases ideal for applications requiring high performance and low latency.

External knowledge base →a key component of AI Agents is the so-called non-parametric knowledge, which is the core of RAG-based application. In this case, the last two years saw the rise of the revolutionary approach of vector search and vector databases.

Note: Vector search is based on embeddings, that are numerical multi-dimensional representations (vectors) of words, computed in such a way that their mathematical distances are representative of their semantic similarity. In RAG-based application, both knowledge base and user’s query are embedded, and only the context with the embedding closest to the user’s query vector are retrieved as context.

There is plenty of DBs in the market to pick from, some vector-native such as Qdrand or Weaviate, others with vector-enabled feature such as Azure SQL Database.

Tracing/logs and conversation history (not cached) →production-ready applications, including AI Agents, need to take into account a logging mechanism and a store for them. In this case, relational database can be a good fit. Plus, they can also be used to store those conversations that exceed the intended context window, or when we simply don’t want to cache them.

Now, having multiple standalone databases is not ideal, since it can affect our AI Agent’s performances. Plus, it creates extra mantainance overload as well. Last but not least, even though it’s often useful to have shared memory across Agents (maybe two or more agents might collaborate to solve a problem), it is still important to allow each agent to mantain it’s own memory which features its whole expertise and personality.

To address all these requirements, a great option is that of using Azure CosmosDB. And if you are wondering whether it is enough to support your AI App, I’ll leave here a quote:

“OpenAI relies on Cosmos DB to dynamically scale their ChatGPT service — one of the fastest-growing consumer apps ever — enabling high reliability and low maintenance.” — Satya Nadella, Microsoft chairman and chief executive officer

So basically, we are talking about the DB behind the most successful AI application 😁

What is Azure Cosmos DB and how to use it with AI Agents

Azure Cosmos DB is a fully managed NoSQL, relational, and vector database. It offers single-digit millisecond response times, automatic and instant scalability, and guaranteed speed at any scale. From geo-replicated distributed caching to backup and vector indexing, it provides the infrastructure to power modern applications. Plus, with SLA-backed availability and enterprise-grade security, business continuity is assured.

CosmosDB offers multiple database APIs to start with:

NoSQL →it stores data in document format (like JSON)
MongoDB →it implements the MongoDB wire protocol and it’s well-suited for document-ordiened data with complex, nested data structures
PostgreSQL →it supports relational data with SQL-like queries.
Cassandra →it is compatible with Apache Cassandra and designed for wide-column store.
Gremlin →it enables graph-based data modeling.
Table →it is designed to store key-value data model.

NoSQL, PostgreSQL and MongoDB also offer the vector store feature with a variety of search algorithms. In this article, we are going to see an example with MongoDB API, leveraging it for both knowledge base storing for RAG and in-memory caching for better performance.

If you want to learn more about CosmosDB, you can do it here.

Populate your DB with embeddings

The first step is creating the CosmosDB instance on your Azure subscription. More specifically, we need to create a CosmosDB for MongoDB with vCore architecture, which supports native vector integration (for a complete guide on the differences between vCore and Request Units, you can read more here).

You can start by configuring the free tier which provides 32GiB of storage.

Once your instance is up and running, you can start populating it with data. You can decide to either work directly from the Mongo Shell available in the Azure Portal:

Or interact with the DB via the pymongo library in Python (you can easily install it via pip install pymongo) . We will use this latter method to upload our embeddings into the database.

To do so, we first need to create a connection towards our DB through a client, and we need our connection string. You can find it under the tab connection string in your Azure CosmosDB instance (you have to fill it with your username and password that you choosed while deploying the instace):

Then, you can initialize the client as follows:

from pymongo import MongoClient

CONNECTION_STRING = "your-conection-string"
client: MongoClient = MongoClient(CONNECTION_STRING)

Great, now that we have our client, we can populate our DB with our embeddings. To do so, we will need:

A database and collection names to be populated by our embeddings:

INDEX_NAME = "vaalt-test-index"
NAMESPACE = "vaalt_test_db.vaalt_test_collection"
DB_NAME, COLLECTION_NAME = NAMESPACE.split(".")

An embedding model (I will use an Azure OpenAI ada-002 ):

from langchain_openai import AzureOpenAIEmbeddings

os.environ["AZURE_OPENAI_API_KEY"] = "xxx"
os.environ["AZURE_OPENAI_ENDPOINT"] = "xxx"

embeddings = AzureOpenAIEmbeddings(
    azure_deployment="text-embedding-ada-002",
    openai_api_version="2023-05-15",
)

A knowledge base to embed and save into our DB. In our case, we will use the PDF paper “A Comprehensive Overview of Large Language Models” by Humza Naveed et al. We will then leverage LangChain libraries to process and chunk the document to get relevant embeddings.

from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import CharacterTextSplitter

loader = PyPDFLoader("https://arxiv.org/pdf/2307.06435")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

Initialize the vector store:

from langchain_community.vectorstores.azure_cosmos_db import (
    AzureCosmosDBVectorSearch,
    CosmosDBSimilarityType,
    CosmosDBVectorSearchType,
)

collection = client[DB_NAME][COLLECTION_NAME]


vectorstore = AzureCosmosDBVectorSearch.from_documents(
    docs,
    embeddings,
    collection=collection,
    index_name=INDEX_NAME,
)

Populate the vector store:

num_lists = 100
dimensions = 1536
similarity_algorithm = CosmosDBSimilarityType.COS
kind = CosmosDBVectorSearchType.VECTOR_IVF
m = 16
ef_construction = 64


vectorstore.create_index(
    num_lists, dimensions, similarity_algorithm, kind, m, ef_construction
)

Note that before populating the vector store, we initialized the following variables (you can see all the variables here):

num_lists: Number of lists or partitions for indexing (affects performance and accuracy).
dimensions: Dimensionality of the vectors (1536 in this case).
similarity_algorithm: The similarity metric used for vector search (COS for cosine similarity).
kind: Type of vector search algorithm (VECTOR_IVF for Inverted File Index).
ef_construction: it determines the number of neighbors to consider during the index construction phase and affects the performance and accuracy of the vector search. Higher values result in better index quality at the price of longer index construction time.

Once the indexing is done, you will see a message similar to the following:

{'raw': {'defaultShard': {'numIndexesBefore': 1,
   'numIndexesAfter': 2,
   'createdCollectionAutomatically': False,
   'ok': 1}},
 'ok': 1}

Now let’s initialize a LangChain Q&A chain to create an intelligent agent to answer our queries in natural language.

# Retrieve and generate using the relevant snippets of the blog.
from langchain import hub
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough

retriever = vectorstore.as_retriever()

message = """
Answer this question using the provided context only.

{question}

Context:
{context}
"""

prompt = ChatPromptTemplate.from_messages([("human", message)])

rag_chain = {"context": retriever, "question": RunnablePassthrough()} | prompt | llm

rag_chain.invoke("What is BLOOM?")

Output:

AIMessage(content='BLOOM is a causal decoder model trained on the ROOTS corpus to open-source a large language model (LLM). The architecture of BLOOM includes differences such as ALiBi positional embedding and an additional normalization layer after the embedding layer, as suggested by the bitsandbytes library. These changes are intended to stabilize training and improve downstream performance.', response_metadata={'token_usage': {'completion_tokens': 71, 'prompt_tokens': 17123, 'total_tokens': 17194}, 'model_name': 'gpt-4', 'system_fingerprint': 'fp_811936bd4f', 'prompt_filter_results': [{'prompt_index': 0, 'content_filter_results': {'hate': {'filtered': False, 'severity': 'safe'}, 'self_harm': {'filtered': False, 'severity': 'safe'}, 'sexual': {'filtered': False, 'severity': 'safe'}, 'violence': {'filtered': False, 'severity': 'safe'}}}], 'finish_reason': 'stop', 'logprobs': None, 'content_filter_results': {'hate': {'filtered': False, 'severity': 'safe'}, 'self_harm': {'filtered': False, 'severity': 'safe'}, 'sexual': {'filtered': False, 'severity': 'safe'}, 'violence': {'filtered': False, 'severity': 'safe'}}}, id='run-8ccfc790-1988-4fb4-a582-0774886a479d-0', usage_metadata={'input_tokens': 17123, 'output_tokens': 71, 'total_tokens': 17194})

As you can see, the model was able to return a correct answer.

Now, let’s imagine that this is a customer-facing AI Agents receiving thousands of questions per second. We clearly have to face a latency issue, since we want to guarantee our users a great performance while using our agent. However, even with the most scalable infrastructure (in terms of both Database and LLM infrastructure — like Azure OpenAI PTUs), the retrieval and generation steps have a physiological time to run, especially when it comes to huge and complex knowledge base.

So how do we deal with that? With caching! The idea is to store in a in-memory database the most frequent asked questions and the corresponding responses, so that the agent can retrieve them more quickly if a new customer ask the same thing.

More specifically, since we are talking about LLMs, with Semantic Caching.

Creating a Semantic Cache

The revolutionary aspect of semantic caching lies in the attribute semantic. In fact, in a pre-GenAI scenario, we could see caching working as follows:

Where a keyword match between user’s query and the saved question is needed to hit the cache.

On the other hand, with semantic cache, there will be an embedding model understanding the semantic meaning of user’s query and comparing it with the in-memory Q&A pairs. This means that user’s query doesn’t have to perfectly match the in-memory queries, it just have to have the same semantic meaning.

Let’s see how to do that:

from langchain.globals import set_llm_cache
import time

# Default value for these params
num_lists = 3
dimensions = 1536
similarity_algorithm = CosmosDBSimilarityType.COS
kind = CosmosDBVectorSearchType.VECTOR_IVF
m = 16
ef_construction = 64
ef_search = 40
score_threshold = 0.9
application_name = "LANGCHAIN_CACHING_PYTHON"


set_llm_cache(
    AzureCosmosDBSemanticCache(
        cosmosdb_connection_string=CONNECTION_STRING,
        cosmosdb_client=None,
        embedding=embeddings,
        database_name=DB_NAME,
        collection_name=COLLECTION_NAME,
        num_lists=num_lists,
        similarity=similarity_algorithm,
        kind=kind,
        dimensions=dimensions,
        m=m,
        ef_construction=ef_construction,
        ef_search=ef_search,
        score_threshold=score_threshold,
        application_name=application_name,
    )
)

Similarly to the semantic search in the vector db, also for the caching memory we need to define the vector search parameters. An important parameter here is the score_treshold: in fact, a lower threshold will lead to a lower likelihood of hitting the cache, so it is important to set it according to our app strategy (performance vs accuracy of response).

Now, let’s try to ask a question against our db and monitor the time of response:

%%time
rag_chain.invoke("What is BLOOM?")

CPU times: total: 125 ms
Wall time: 27.5 s

AIMessage(content='BLOOM i125s a causal decoder model trained on the ROOTS corpus to open-source a large language model (LLM). The architecture of BLOOM includes differences such as ALiBi positional embedding and an additional normalization layer after the embedding layer, as suggested by the bitsandbytes library. These changes are intended to stabilize training and improve downstream performance.', response_metadata={'token_usage': {'completion_tokens': 71, 'prompt_tokens': 17123, 'total_tokens': 17194}, 'model_name': 'gpt-4', 'system_fingerprint': 'fp_811936bd4f', 'prompt_filter_results': [{'prompt_index': 0, 'content_filter_results': {'hate': {'filtered': False, 'severity': 'safe'}, 'self_harm': {'filtered': False, 'severity': 'safe'}, 'sexual': {'filtered': False, 'severity': 'safe'}, 'violence': {'filtered': False, 'severity': 'safe'}}}], 'finish_reason': 'stop', 'logprobs': None, 'content_filter_results': {'hate': {'filtered': False, 'severity': 'safe'}, 'self_harm': {'filtered': False, 'severity': 'safe'}, 'sexual': {'filtered': False, 'severity': 'safe'}, 'violence': {'filtered': False, 'severity': 'safe'}}}, id='run-e00630f1-ac2e-46e7-9e75-6c585ec58c50-0', usage_metadata={'input_tokens': 17123, 'output_tokens': 71, 'total_tokens': 17194})

Now, both the question and the response have been saved into the semantic cache, so in theory, if we ask the same question (or a similar question), we should have a quicker response. Let’s try:

%%time
rag_chain.invoke("What is BLOOM?")

CPU times: total: 15.6 ms
Wall time: 4.78 s

AIMessage(content='BLOOM is a causal decoder model trained on the ROOTS corpus to open-source a large language model (LLM). The architecture of BLOOM includes differences such as ALiBi positional embedding and an additional normalization layer after the embedding layer, as suggested by the bitsandbytes library. These changes are intended to stabilize training and improve downstream performance.', response_metadata={'token_usage': {'completion_tokens': 71, 'prompt_tokens': 17123, 'total_tokens': 17194}, 'model_name': 'gpt-4', 'system_fingerprint': 'fp_811936bd4f', 'prompt_filter_results': [{'prompt_index': 0, 'content_filter_results': {'hate': {'filtered': False, 'severity': 'safe'}, 'self_harm': {'filtered': False, 'severity': 'safe'}, 'sexual': {'filtered': False, 'severity': 'safe'}, 'violence': {'filtered': False, 'severity': 'safe'}}}], 'finish_reason': 'stop', 'logprobs': None, 'content_filter_results': {'hate': {'filtered': False, 'severity': 'safe'}, 'self_harm': {'filtered': False, 'severity': 'safe'}, 'sexual': {'filtered': False, 'severity': 'safe'}, 'violence': {'filtered': False, 'severity': 'safe'}}}, id='run-e00630f1-ac2e-46e7-9e75-6c585ec58c50-0', usage_metadata={'input_tokens': 17123, 'output_tokens': 71, 'total_tokens': 17194})

As you can see, we had a reduction in response time by 87.5%.

Conclusions

Semantic caching is a game chager in the landscape of AI Agents, especially when it comes to high-throughtput applications. The possibility of having an in-memory vector repository makes it easier to match expected performance while mantaining a high accuracy in terms of (semantic) matching between user’s query and saved Q&A pairs. Obviously, in production deployment we will need to define a “frequent q&a” pattern to decide which Q&A pairs to save, and this can be achieved with proper monitoring.

Overall, with CosmosDB we can improve multiple facets of our architecture thanks to its flexibility, yet mantaining one single DB service as backend.