How To Improve Language Models through Retrieval-Augmented-Generation (RAG)

Now you can get factual, relevant, and up-to-date information

LLM responseLanguage Models are great, but have you ever tried asking them something about a complex topic, such as about your company or an event that happened last week? You would likely have been disappointed if you did, but this blog explains why it happens and how to solve it!

Challenges of Large Language Models (LLMs)

When OpenAI released ChatGPT in 2022 and LLMs became mainstream, funny and not-so-funny examples of completely wrong answers to simple questions appeared everywhere. This graphic shows a typical LLM Hallucination (Source: Zhang et al).

The response quality of language models has improved significantly since 2022, with further enhancements just around the corner (LLM performance is usually measured through various benchmarks). 

Let's look at some of the biggest challenges: 1. Hallucinations 2. Outdated Information 3. Lack of knowledge about your company and 4. Trust.

1. Hallucinations 

Imagine a medical professional asking a LLM for symptoms of a rare disease and getting non-factual or outdated information as a response.

Many LLMs still hallucinate today and sound very confident about any nonsense they produce. That is particularly problematic for LLM usage in highly complex and specialized domains. Imagine a medical professional asking a LLM for symptoms of a rare disease and getting non-factual or outdated information as a response.

2. Outdated information due to training cutoff dates

LLM

Another challenge of LLMs is their training cutoff date (i.e. the most recent data that the model was trained with). An LLM with a cutoff date of 1/1/2024 can’t factor in any information that became available after this date. To come back to our example, that means the model wouldn’t be able to incorporate the results of a breakthrough study about the rare disease that was published on 2/1/2024 even though the model probably was made public well after the study was published.

3. Lack of Knowledge about a particular company or domain

Using LLMs to solve business problems reveals another challenge: LLMs don’t know anything about your company that is not publicly available, and what they do know is usually only a small portion of what's out there. For example, let's consider a medical professional who is asking how to treat a rare disease. The LLM might respond with a treatment method that this hospital doesn’t offer, without providing alternative treatment methods, or a list of nearby hospitals that do offer the optimal treatment method.

4. Lack of trust due to missing sources

These challenges have one thing in common: they make it incredibly hard to trust the output of LLMs since they usually don’t provide any sources for their response. So how could an LLM help you and your business tackle questions that take your situation and needs into account?

Of course, there are other challenges like challenging evaluation and ethical considerations, but they are out of the scope of this blog.

How to solve these challenges?

Hallucinations, outdated information, missing business/domain knowledge, and trustworthiness make using an LLM in a professional context difficult. Fortunately, there is a solution to these challenges: Retrieval-Augmented-Generation (or short RAG, proposed by Patrick Lewis et al).
RAG is a combination of Information Retrieval techniques and language models. It uses information from user-provided sources, typically stored in a NoSQL database, where the language model formulates its response using parts of this information.

RAG applicationFigure 1: Simplified RAG application

Components of RAG application

The typical components of a RAG application are a language model, a knowledge base, and a retriever.

  • The knowledge base contains relevant documents (or other media formats like images, and videos) for solving our problems. This could be the entire Wikipedia, company documents, or scientific papers. The information is usually chunked and vectorized/embedded to project information into a mathematical space that enables computer systems to compare them to each other. These vectors are then stored in a vector database for fast retrieval. Variations like GraphRAG store the information in a knowledge graph.
  • The retriever is used to find the top k information chunks most likely to be useful in responding to the prompt. The prompt is vectorized and a nearest neighbor search is performed by the retriever, sometimes followed by a ranking of the retrieved information chunks. These information chunks are then sent to the language model alongside the prompt.
  • The language model is used to respond to the prompt with the provided chunks of information and formulates a response that is grounded in the retrieved information, usually providing the relevant chunks as sources. In some cases, language models are used to rephrase the initial prompt into an optimized format for the retriever.

What are example use cases for RAG?

Our team at DesignMind successfully implemented several RAG applications for our clients. One of them is a natural language to SQL tool that responds with valid SQL code that is tailored to the client's databases. Check out this video from my colleague Rob Kerr for more details.

 

We also built a GraphRAG system that compares complex documents to a gold standard document. It then reports the differences between them and highlights potentially disadvantageous parts. Additionally, we built some agentic systems for different analytical tasks. These agentic systems can explain BI-report screenshots, provide information on how certain KPIs are calculated, solve ad-hoc data analysis requests, and automate report-to-report comparisons.  We explain agentic agents in this recent blog post: Agentic AI and AutoGen: How Agents can help you.

Some other examples are:

  • A personal AI assistant that makes it easy to find and summarize relevant information on specific topics across your whole system using all your work documents and e-mails as knowledge.
  • Enhanced AI code assistants with your codebase as the knowledge base.
  • Education coaches that tailor learning material to the user’s knowledge and learning style.
  • Research assistants that source information from a vast selection of scientific papers.

Advantages and Disadvantages of RAG

There are several advantages and disadvantages of RAG applications, compared to other LLM improvement options like prompt engineering (limited effectivity) and fine-tuning (computational very heavy task). RAG incorporates prompt engineering techniques and can also be combined with model fine-tuning.

Disadvantages

  • Higher Complexity: Since there are more components involved, RAG applications tend to be a little more complex than fine-tuned systems.
  • Longer Response Times: The additional components lead to longer response times, especially if a large knowledge base is used. This can be improved by using a faster model.
  • Quality Dependence: The response quality is highly dependent on the quality of the knowledge base and the retriever.
  • Lacks Tone and Format Adaptability: Compared to fine-tuning, the adaptability of tone and output format is limited in RAG applications.

Advantages

  • Better Accuracy: By providing the LLM with the necessary knowledge to respond to the prompt, the accuracy for specific use cases can be more than doubled. Grounding the model with a curated knowledge base of choice like Wikipedia can also improve general accuracy.
  • Fewer Hallucinations: Since the LLM is usually limited to information that is retrieved from the knowledge base, it is much less likely to hallucinate.
  • Highly adaptable: A company chatbot, a cooking companion, or a research assistant with highly specific knowledge about the latest scientific papers are all possible with RAG.
  • Up-to-date Data: Updating the knowledge base with up-to-date information is relatively easy, especially compared to frequent fine-tuning a language model.
  • Up-to-date Technology: The performance of language models is increasing incredibly fast (see Figure 3). Luckily, the model that is used in a system can be switched to the most advanced in no time.
  • Cost-effective: Compared to frequent fine-tuning, systems are much cheaper and since the knowledge is stored in the knowledge base, a much cheaper language model will produce perfectly fine responses for most use cases.
  • More Trust: Getting to see the sources used to create the response builds trust in the system when using a reputable knowledge base.

LLM performance increaseFigure 2: Performance Increase of LLMs since March 2023 (Source: https://llm-stats.com/)

How to set up a RAG Application

Depending on your current infrastructure and use case, there are some things to consider before setting up a RAG application. Fortunately, every big cloud service provider offers all the necessary components and more. If you have an on-premises system that can run language models, you can even set up an offline version if needed.

An example workflow (regardless of your infrastructure) could be:

  • Collect your Data: The first step is to collect the data that you want to use as your knowledge base. This could be all your internal documents and communication, Wikipedia, technical documentation, database schemas and documentation, or scientific papers.
  • Set up a Knowledge Store: Next, you need a knowledge store to store your data. Some options are Microsoft Azure’s AI Search and Pinecone for managed vector databases, Chroma and Weaviate for open-source vector databases, or Microsoft’s GraphRAG and Neo4j for knowledge graphs (works better for complex and inter-connected data).

A typical workflow might look like this: 1. Collect your data  2. Set up a Knowledge Store  3. Prepare your Data  4. Pick a Language Model  5. Set up the System

  • Prepare your Data: Your data needs to be chunked and vectorized to be stored in a vector store. All of the aforementioned options offer at least one chunking method, but you can also use services like io or Microsoft Azure’s Document Intelligence to divide your texts into semantic chunks. These chunks then have to get vectorized using either an embedding model like OpenAI’s text-embedding-3-small or NLP methods like TF-IDF Vectorization from scikit-learn. After vectorizing the chunks, we put them into the vector store and create the vector index.
  • Pick a Language Model: Select a suitable language model for your use case. This can be everything from proprietary models like OpenAI’s o-1 or Anthropic’s Claude 3.5 Sonnet, to open-weight models from HuggingFace or even Small Language Models like Microsoft’s Phi-4.
  • Set up the System: Now everything comes together with an LLM call that includes the retriever and a sophisticated system prompt that limits the model to create the response based only on the retrieved chunks in a suitable format. An example system prompt could be:
    “Context: ||{context_str}||
    Using the above context encapsulated in ||, answer the query in a concise format with a maximum of 50 words. Provide the context you used as a source.
    Use only the context provided to answer the questions. If the context does not answer the question, answer that you don’t know the answer and suggest asking differently.
    Query: {query_str}
    Answer: "

Congratulations, you just set up a basic RAG application! The next step should be to develop an evaluation workflow that enables you to improve your system further. For improvements you try different chunking and embedding methods or vector stores, optimize the retriever, introduce semantic ranking, test different language models, or improve the system prompt.  

Now you have a clear overview of how Retrieval-Augmented Generation can enhance a wide range of use cases, and the many challenges it can solve in today's world.

Jeremy Schieblon is a Senior Data Science Consultant at DesignMind. Formerly with the German InsurTech company Eucon, he specializes in transparent Machine Learning using Azure Machine Learning and Python.