Articles

How to Finally Get AI Assistants That Understand Your Problems and Do the Work for You?

Explore advancements in AI assistant technology for 2024, focusing on the critical importance of context and innovative techniques like fine-tuning and Retrieval Augmented Generation (RAG). Discover how these developments can transform AI into specialized solutions tailored to specific needs.

Apr 16, 2024

Read time: ~9 min.

They say that 2024 is going to be the year where AI assistants come into the mainstream. Some hail them as the holy grail of AI automation, a new way of computing, where you can speak to your computer Star Trek style. But as we already see some examples of such experiences with Open Interpreter, Rabbit R1 and Humane Ai Pin people are quick to note their limitations. Similarly to ChatGPT, most of the time they are generic in their output and their actions (if they are capable of performing actions at all). So whenever you want to use all these tools for something actually useful, you need to have a very long conversation about what you need and how they should behave. Very often you need to feed them several documents or books for them to understand the premise.

Which is exactly the real problem in working with Generative AI these days: it’s just too generic. Quite often generic works, but only for proof of concept or as a starting point. Never for the finishing touches of whatever you want to accomplish. This is the reason why humans still don’t trust AI. All software engineers for instance, have learned to use ChatGPT and GitHub Copilot, yet they laugh at the very concept of having them create a fully working application. The recent viral marketing campaign of Devin AI promised a fully working AI software development assistant, only to be quickly debunked by the global software development community to be just a well scripted demo, that shows Devin working with files in open-source libraries that don’t exist at all.

This whole experience has led many to disbelieve any idea of actual useful AI assistants. Yet many more still believe that this technology is just about to emerge in 2024, why is that? As you will soon find out, the building blocks are already here and albeit a bit crude and complex to set up, they do allow you to have actual useful AI assistants aware of the actual context of your specific problem.

So, what exactly is context?

Context in the concept of AI and LLMs is the LLMs understanding of the premise of the prompt. It’s a combination of the training data that has been used to train the LLM and the information contained in the actual prompt. For instance, if you take a simple chat application as ChatGPT, your message that initiates the conversation with ChatGPT is not the initial text of the conversation. ChatGPT as an application first provides a general instruction to the LLM that is hidden from you as a user. It contains text that is something along the lines of “You are helpful AI assistant. You will be respectful and useful to the user, you will be politically correct, you will not promote illicit activities, you will not use bad words, etc.” This initial prompt, combined with the knowledge gained by the LLM of how conversations should look like from its training data is what makes ChatGPT conversations behave the way you already know.

A typical LLM chat implementation actually sends the whole history of the conversation, all the previous messages you’ve exchanged, plus the initial prompt to the LLM every time a new reply is generated. That’s why when you have long conversations with LLMs often they forget the beginning of the conversation. It’s just that their memory buffer, also technically called context, is overfilled and the initial part of the conversation is truncated.

Is it all about just managing the context?

Of course not, but managing the context is one of the most important things. There are multiple ways for doing this, one that is often cited is fine-tuning the models. Fine-tuning in the context of machine learning refers to the process of adapting a pre-trained model for specific tasks or use cases. Let me break it down for you: Pre-Trained Models (similar to ChatGPT’s GPT-4) are models that have already been trained on a large amount of data. Fine-tuning leverages the knowledge a pre-trained model has already acquired as a starting point for learning new tasks. Instead of training a new model from scratch, we can build upon the existing knowledge by providing it additional training (very similar to the original one), but with the specific context of the tasks we need it to perform. It is much easier and more cost-effective to refine a pre-trained base model than to train a new one from scratch for a particular use case.

In a real-world situation, while this will allow you to have your AI assistants achieve your task, whenever the context changes, you will need to fine tune the model again. For example, let’s assume we have model fine-tuned on your particular company data. It will be useful no doubt, but as new knowledge is acquired and as the situation changes, your company data will change as well and at some point of time your model will be outdated to the reality. Remember how ChatGPT was initially limited to knowledge up to September 2020? And although fine-tuning is faster and cheaper than training a model from scratch it is still model training, which is a cumbersome and costly process.

A far easier approach is what has colloquially become prompt engineering. Prompt engineering refers to a number of different techniques that manage the context window of an LLM. In the implementation of a chat application we gave above, prompt engineering can refer to the fact that the whole history of the conversation is prompted to the model every time a new message is required. The prompt (what is essentially the query provided to the LLM) can include instructions on how the answer should look like, but it can also include fine-tuning of the actual context of the conversation. A typical example from the software engineering world would be when a software developer uses ChatGPT and provides the source code files and the error report that he’s trying to fix before asking ChatGPT to help him fix the issue. Think of it as a mini fine tuning for the current chat session only.

So that means every time you want the model to do something useful, you just feed it all your specific data, right? That would be easy now wouldn’t it, however models have a limit of the number of characters (also called tokens) they can use in a single LLM prompt. And you probably have a lot of company information, multiple files, and web pages you need to feed, not to mention the actual instruction on what you want to have the model do, and perhaps a whole history of preceding conversation. And this is why a lot of companies are trying to make the context windows of their models larger (Google Gemini’s recent 1 million token version is an example).

But no matter how large your model’s context window is, the AI assistant you need requires a lot of context information. And as tasks become more complex and specific, the context window grows larger and sooner or later you will run out of context windows space. That’s why a lot of techniques around prompt engineering are actually techniques on summarizing the context space. But is there a way to reverse the trend, instead of constantly expanding the context and feeding more and more data to each prompt, have the model itself extract the context it needs to complete its task?

Enter RAG – Retrieval Augmented Generation

RAG a.k.a. Retrieval Augmented Generation is an architectural technique that links an LLM to an information retrieval mechanism. Think of it like teaching and LLM to use Google Search. This means that every time an LLM is prompted, it can generate a small query to its own internal search engine so it can extract a specific part of your company’s data and add it to the initial prompt so it can generate a better response. This is for instance, how Bing’s AI search is implemented.

As such, RAG is a popular technique for managing the context for LLMs. But RAG has its own limitations as well, RAG requires that you build a database of precompiled company data. This means that you have a problem similar to the fine-tuning scenario, where every time your company data changes, you will need to recompile your data. Of course, updating a database is much easier and cheaper than retraining a model (even if it’s only fine-tuning). So how can you have an AI useful to your specific use case?

A specific context. Each time you ask your AI assistant to do something, you need it to orient itself and have very good understanding of the current situation of the real world (your document space) and you need it to have very good instructions on what the final result should look like. And since LLM models are non-deterministic - that means you cannot guarantee what the output would be, you need to have a feedback mechanism for the output to be correct. In the all too familiar ChatGPT use case, the feedback mechanism is you the user. Each time the LLM is either wrong or not specific for your needs, you correct it by giving it feedback in yet another message. But this back and forth between the LLM and you, can also be automated by just having two LLMs talk to each other. One having the role of the prompter and the other having the role of the promptee. Where one attempts to retrieve the information from your internal RAG database and formulate the response, and the other to verify that the response matches the needs of the initial prompt. And have this internal back and forth conversation as many times as needed, until finally your precise required and useful response is achieved.

These are the techniques we at Tessier have specialized on, so that we can build custom role-based AI assistants for software development companies. Using such techniques, we are able to build AI assistants that can work on large enterprise projects and generate whole PRs, validate and write large requirement documents, create, read and verify use cases in Jira or perform automated tests in your CI/CD pipelines.

logo

2024 © Tessier. All rights reserved