bbv KI Webinar - Agent-RAG

Webinar: Agent-RAG

3.7.2024ca. 20 participants

In my most recent webinar I spoke together with my colleague Martin König and our moderator Stefan Heberling about how Retrieval Augmented Generation (RAG) works in practice. We wanted to show what technical details lie behind this approach and why it is so valuable for so many applications. The main focus was on the interplay between the various pipelines that allow an AI model to draw on company-specific or publicly available data and use it intelligently.

Loading YT...

Martin opened the webinar by noting that we would be using more technical terminology than usual and would therefore spend time on basic explanations at the start. This covered above all the Large Language Model (LLM), which is at the heart of every AI agent, and the concept of the system prompt, which defines how the model's responses are formulated. We also made clear that the query, the user's question, and the additional context from a knowledge database must work together to deliver a comprehensive answer.

"A Large Language Model is the core of an AI agent: it generates continuous text by meaningfully extending what has already been said, forming the basis for every generative application."

Overview of the Pipelines

The core of RAG can be broadly divided into three sections. These so-called pipelines determine how information is ingested, prepared and ultimately used. The first pipeline, the Ingestion Pipeline, is concerned with selecting documents from various sources, cleaning them and structuring them. These data are then stored in a knowledge database that can make them quickly available again later.

The second pipeline, the Query Pipeline, handles everything that happens when a user asks a question. It examines what the user actually wants and how their query might be reformulated to yield more relevant results. The third pipeline, the Retrieval Pipeline, then searches the knowledge database for the appropriate answers by comparing the vector of the user's question with the vectors of the chunks. The most similar chunks are initially selected because they match best in terms of content, and together with the original question they are processed into a fluent, comprehensible text.

The Ingestion Pipeline

The first step in a RAG system involves reading and processing data before it goes into a knowledge database. During this step a decision is made about which documents to include at all and whether access permissions need to be taken into account. If internal policies prohibit certain files from being shared with an AI system, that content can be filtered out early on.

Once the right data have been selected, they are split into small sections. This process is usually referred to as chunking and its purpose is to divide knowledge in such a way that the model can later retrieve exactly the right excerpts. A simple approach is to split into fixed lengths, say every few thousand words. Often, however, it makes more sense to preserve the structure of a text and cut at headings or thematic sections. This ensures that a section remains coherent and does not mix with entirely different topics.

"A perfectly crafted text splitter divides the document the way we as humans would: at headings, at coherent paragraphs or at clearly delineated text boxes."

In the webinar we also pointed out that standard tools often do not know which elements of a document are superfluous. They sometimes include navigation menus, links and captions in awkward places, producing fragments of limited usefulness. A tailored solution that recognises exactly which content to use and splits it sensibly into chunks can remedy this.

Once the sections are created and cleaned, each one can be converted into a numerical vector using a language model. These vectors later reside in the knowledge database and enable semantic search. Metadata such as categories, access permissions or timestamps can supplement the content. This simplifies handling different document versions later and prevents confidential information from reaching unauthorised users.

The Query Pipeline

As soon as a user asks a question, the Query Pipeline springs into action. An important topic here is intent recognition. We want to understand what lies behind the question. Has a document been uploaded and simply needs to be summarised? Or does someone want to search sources stored in the company network? One simple approach is to feed a language model with example sentences and let it decide which category the request falls into. More complex scenarios can be handled using a dedicated classification model.

In some cases it also makes sense to rewrite the query, the user's actual question. This so-called query rewriting can help to create a more comprehensive version rather than a very narrow question. In the webinar we discussed, for example, the technique called Hypothetical Document Embedding. Here, a hypothetical answer is generated from the question, which is then vectorised in order to better match the existing documents. At the final output stage, the user naturally receives the answer to their original question; only in the background do we use a rewritten version for the database search.

The Retrieval Pipeline

The final step decides how the actual answer is generated. The Retrieval Pipeline searches the knowledge database for the appropriate text passages by comparing the vector of the user's question with the vectors of the chunks. The most similar chunks are initially selected because they match best in terms of content. If there are too many possible hits, the list can be refined further using so-called re-ranking methods. A language model then evaluates, for example, which chunks are truly most relevant and reorders them.

"Retrieval means finding the right text passages using a query in the vector space by determining the semantic distance and filtering out which ones are closest in meaning."

After retrieval, the system can limit the volume of incoming text pieces to avoid overloading the generative language model with too much context. Too few chunks are equally problematic, however, because relevant information may then be missing. In the end all selected chunks are sent together with the original question and a system prompt to the large language model. These three elements, system prompt, context and query, ultimately result in a coherent text as the response.

Hallucination and Sources of Error

In the webinar we were often asked whether a RAG system can hallucinate. Strictly speaking, any language generator can invent facts or mix up details when it lacks the right context. RAG does, however, reduce the likelihood of such errors. The AI bases its answers on external and generally verified content from the knowledge database. At the same time, it can still draw on its general language knowledge, which in rare cases can lead to incorrect additions. The key advantage is that the model works with queryable sources rather than spinning everything out of its own "memory".

When a document is very extensive, another problem can arise: the model receives only excerpts. Anyone expecting a global summary of the entire content runs up against the context limit. Planning a RAG system therefore requires thinking about whether a user needs the complete document or only individual details. Modern language models offer longer context windows, which partially resolves this problem and allows more extensive analyses when needed.

Conclusion and Outlook

My summary of the webinar showed how Retrieval Augmented Generation works in practice. It is not only a clever method for making vast amounts of data manageable; it is also an opportunity to sustainably improve internal research within organisations. Once you have experienced how quickly an AI system delivers a precise answer, you will rarely go back to rigid file structures, lengthy full-text searches or manual scrolling.

Setting up a RAG system does require expertise. The correct configuration of the ingestion pipeline, the integration of access permissions and metadata and the fine-tuning of query and retrieval often impose special requirements. In the next webinar we will focus more on the business-oriented side of generative AI solutions. We want to answer questions such as: "What distinguishes Microsoft Copilot from ChatGPT?" and "When do you additionally need an AI solution like the AI Hub?" This is intended to provide orientation for those who have to make decisions about introducing AI systems without needing to know every technical detail.

"Once you have experienced how efficiently a RAG system delivers precise answers, you will not want to return to traditional research methods."

With that, the webinar concluded and the most important questions about RAG were addressed. I thank everyone who took part and look forward to the next time, when we will go deeper into business perspectives and strategic considerations.

bbv - AI Impact Forum 11. «Industrieforum 2025»

11. «Industrieforum 2025»bbv - AI Impact Forum