bbv KI Webinar - Data Flows

Webinar: Data Flows in AI Systems

28.2.2024ca. 40 participants

I held a webinar that took an intensive look at data flows in AI systems. The central question was how large language models, such as GPT, actually work and exactly where we as users have the decisive lever to guarantee the quality and security of responses. My colleague Allen moderated the evening, introduced our previous topics in the webinar series and emphasised from the start that we wanted to take "a look under the hood".

It was important to me to show participants that AI is not magic and does not generate knowledge "out of thin air" on its own. The technology is based on statistical methods and mathematical models that only prove helpful when we supply them with the right information. At the same time I wanted to make understandable why we should keep a close eye on so-called "data flows" as soon as we integrate AI models into our workflows.

"Generative AI is not a magic trick. It only works through interaction and the deliberate control of the underlying data."

This statement ran as a connecting thread through the entire webinar. I wanted to emphasise that every user directly influences the outcome the moment they ask questions or transmit information.

Loading YT...

Three Essential Sources for the Model

To clarify what is meant by "data flows", I explained which sources supply a large language model like GPT with information. First, there is the knowledge we provide directly when asking a question. Second, company-specific databases can be integrated to provide additional or current knowledge. And third, there is the so-called model knowledge, the statistical relationships that have become embedded in the model from its training data.

I described how important it is for many organisations to be able to connect their internal expertise, such as documents, manuals or guidelines, in the form of a database. Especially when the model's standard answer falls short or perhaps even contains outdated facts, this approach yields significantly higher accuracy and relevance. Otherwise the answers rest solely on the model's enormous but possibly no longer fully current or suitably oriented training material.

"The model brings its own 'general knowledge', but it only becomes truly precise when we make clear what is relevant to our specific case."

I noted that this second data source, company-specific knowledge, is our most important supplement to the model's more global knowledge. How best to integrate it depends, among other things, on the software architecture used and the data protection requirements.

How Context Is Formed

The webinar also covered how we can help the model formulate appropriate answers. I spoke about "Retrieval-Augmented Generation". Behind this term is the idea of not simply sending a question to the model but first deliberately retrieving information from a knowledge database and attaching it to the model as context. This way, not all possible documents are given to the system indiscriminately, only those that are truly relevant.

This approach reduces the risk of incorrect or contradictory statements. Above all it supports traceability. If the question arises afterwards as to why the AI reached a particular result, one can easily see which text passages served as the basis.

"Instead of confusing the model, we give it an exact selection of relevant content. This keeps the answer more precise and consistent."

I explained using an example that a language model responding to the query "What does it mean when someone is on the bench?" can go in very different directions. Is it thinking of a park bench, a financial institution, or the sporting context? Anyone who clarifies the appropriate background information from the outset will generally receive considerably more precise results.

Agents as Supporting Actors

In the course of the webinar I showed that a third component can also come into play: software agents that take on different functions. One agent could be specialised in retrieving financial data, while another handles sales questions. These small helpers orchestrate how the large language model should respond to certain inputs.

Many participants were particularly surprised at this point by how many processes can be automated or predefined before the actual language model encounters a question. I described how this ensures that not every person has to type the perfect commands ("prompts") themselves, but that certain steps are automatically handled by agents.

"An agent can retrieve the passages, prepare them sensibly and if necessary go through several intermediate steps before querying the model."

My experience shows that this can be a blessing in complex environments such as large organisations. Users save time, get consistent results and do not have to play the entire keyboard of AI control manually every time.

Handling Sensitive Data

With all the enthusiasm for the topic, limits and risks were also addressed. I made clear that a language model does not "continue learning" during use; it does not maintain a growing database in the background that automatically stores our inputs. Nevertheless, many providers store conversation histories to improve the system later or to train new models. This is a question of company policy and contractual arrangements.

An important point was the data protection perspective. Anyone who shares personal information with an external provider must carefully check which data is leaving the server and whether it should perhaps be anonymised. I explained that architectural solutions exist that process sensitive data in a separate environment before passing it on to the large model.

"Sensitive data should not simply be dumped into a publicly accessible AI system without thinking. Technically this can often be solved, but it must be approached consciously."

I made clear that an AI system like GPT does not guarantee error-free output. It can "hallucinate" when context is wrong or missing. A systematic approach to sources and a final check by humans remain essential in many cases. It was important to me to communicate this clearly so as not to raise false expectations.

Outlook and Closing Thoughts

After illuminating the various aspects, from integrating a vector database through agent concepts to security, Allen concluded by previewing the next webinar in our series. On 11 April, Emre will guide us into the world of business application of AI. The focus will be on how to develop a viable AI product from a good idea and which strategies have proven themselves.

I personally look forward to this contribution greatly because it spans the arc from technology to real business practice. Often the technical feasibility is already there, but the value and successful implementation in everyday business life are challenges of their own.

"Successful AI projects emerge at the intersection of a good idea, efficient use of data and realistic implementation strategies."

At the end of the webinar there was a window for questions. It turned out that many participants apparently found the connections very clear, as no specific follow-up questions came. Allen joked that we were almost a little disappointed, but usually silence means the message was understood.

With that I closed the evening. My goal was to show how a language model really "ticks" in practice and why we should keep a close eye on all data flows. For me it is settled that a sound understanding of this process is the key to using AI solutions effectively and responsibly. I thank everyone who attended the webinar and look forward to continuing to develop this knowledge together.

9th Datenschutztag AI & Data Protection bbv KI Webinar - Knowledge Management

bbv KI Webinar - Knowledge Management 9th Datenschutztag AI & Data Protection