In this article, we’ll explore the use of prompt compression techniques in the early stages of development, which can help reduce the ongoing operating costs of GenAI-based applications.
Often, generative AI applications utilize the retrieval-augmented generation framework, alongside prompt engineering, to extract the best output from the underlying large language models. However, this approach may not be cost-effective in the long run, as operating costs can significantly increase when your application scales in production and relies on model providers like OpenAI or Google Gemini, among others.
The prompt compression techniques we’ll explore below can significantly lower operating costs.
Challenges Faced while Building the RAG-based GenAI App
RAG (or retrieval-augmented generation) is a popular framework for building GenAI-based applications powered by a vector database, where the semantically relevant data is augmented to the input of the large language model’s context window to generate the content.
While building our GenAI application, we encountered an unexpected issue of rising costs when we put the app into production and all the end users started using it.
After thorough inspection, we found this was mainly due to the amount of data we needed to send to OpenAI for each user interaction. The more information or context we provided so the large language model could understand the conversation, the higher the expense.
This problem was especially identified in our Q&A chat feature, which we integrated with OpenAI. To keep the conversation flowing naturally, we had to include the entire chat history in every new query.
As you may know, the large language model has no memory of its own, so if we didn’t resend all the previous conversation details, it couldn’t make sense of the new questions based on past discussions. This meant that, as users kept chatting, each message sent with the full history increased our costs significantly. Though the application was quite successful and delivered the best user experience, it failed to keep the cost of operating such an application low enough.
A similar example can be found in applications that generate personalized content based on user inputs. Suppose a fitness app uses GenAI to create custom workout plans. If the app needs to consider a user’s entire exercise history, preferences, and feedback each time it suggests a new workout, the input size becomes quite large. This large input size, in turn, means higher costs for processing.
Another scenario could involve a recipe recommendation engine. If the engine tries to consider a user’s dietary restrictions, past likes and dislikes, and nutritional goals with each recommendation, the amount of information sent for processing grows. As with the chat application, this larger input size translates into higher operational costs.
In each of these examples, the key challenge is balancing the need to provide enough context for the LLM to be useful and personalized, without letting the costs spiral out of control due to the large amount of data being processed for each interaction.
How We Solved the Rising Cost of the RAG Pipeline
In facing the challenge of rising operational costs associated with our GenAI applications, we zeroed in on optimizing our communication with the AI models through a strategy known as “prompt engineering”.
Prompt engineering is a crucial technique that involves crafting our queries or instructions to the underlying LLM in such a way that we get the most precise and relevant responses. The goal is to enhance the model’s output quality while simultaneously reducing the operational expenses involved. It’s about asking the right questions in the right way, ensuring the LLM can perform efficiently and cost-effectively.
In our efforts to mitigate these costs, we explored a variety of innovative approaches within the areas of prompt engineering, aiming to add value while keeping expenses manageable.
Our exploration helped us to discover the efficacy of the prompt compression technique. This approach streamlines the communication process by distilling our prompts down to their most essential elements, stripping away any unnecessary information.
This not only reduces the computational burden on the GenAI system, but also significantly lowers the cost of deploying GenAI solutions — particularly those reliant on retrieval-augmented generation technologies.
By implementing the prompt compression technique, we’ve been able to achieve considerable savings in the operational costs of our GenAI projects. This breakthrough has made it feasible to leverage these advanced technologies across a broader spectrum of business applications without the financial strain previously associated with them.
Our journey through refining prompt engineering practices underscores the importance of efficiency in GenAI interactions, proving that strategic simplification can lead to more accessible and economically viable GenAI solutions for businesses.
We not only used the tools to help us reduce the operating costs, but also to revamp the prompts we used to get the response from the LLM. Using the tool, we noticed almost 51% of savings in the cost. But when we followed GPT’s own prompt compression technique — by rewriting either the prompts or using GPT’s own suggestion to shorten the prompts — we found almost a 70-75% cost reduction.
We used OpenAI’s tokenizer tool to play around with the prompts to identify how far we could reduce them while getting the same exact output from OpenAI. The tokenizer tool helps you to calculate the exact tokens that will be utilized by the LLMs as part of the context window.
Prompt examples
Let’s look at some examples of these prompts.
- Trip to Italy
Original prompt:
I am currently planning a trip to Italy and I want to make sure that I visit all the must-see historical sites as well as enjoy some local cuisine. Could you provide me with a list of top historical sites in Italy and some traditional dishes I should try while I am there?
Compressed prompt:
Italy trip: List top historical sites and traditional dishes to try.
- Healthy recipe
Original prompt:
I am looking for a healthy recipe that I can make for dinner tonight. It needs to be vegetarian, include ingredients like tomatoes, spinach, and chickpeas, and it should be something that can be made in less than an hour. Do you have any suggestions?
Compressed prompt:
Need a quick, healthy vegetarian recipe with tomatoes, spinach, and chickpeas. Suggestions?
Understanding Prompt Compression
It’s crucial to craft effective prompts for utilizing large language models in real-world enterprise applications.
Strategies like providing step-by-step reasoning, incorporating relevant examples, and including supplementary documents or conversation history play a vital role in improving model performance for specialized NLP tasks.
However, these techniques often produce longer prompts, as an input that can span thousands of tokens or words, and so it increases the input context window.
This substantial increase in prompt length can significantly drive up the costs associated with employing advanced models, particularly expensive LLMs like GPT-4. This is why prompt engineering must integrate other techniques to balance between providing comprehensive context and minimizing computational expense.
Prompt compression is a technique used to optimize the way we use prompt engineering and the input context to interact with large language models.
When we provide prompts or queries to an LLM, as well as any relevant contextually aware input content, it processes the entire input, which can be computationally expensive, especially for longer prompts with lots of data. Prompt compression aims to reduce the size of the input by condensing the prompt to its most essential relevant components, removing any unnecessary or redundant information so that the input content stays within the limit.
The overall process of prompt compression typically involves analyzing the prompt and identifying the key elements that are crucial for the LLM to understand the context and generate a relevant response. These key elements could be specific keywords, entities, or phrases that capture the core meaning of the prompt. The compressed prompt is then created by retaining these essential components and discarding the rest of the contents.
Implementing prompt compression in the RAG pipeline has several benefits:
- Reduced computational load. By compressing the prompts, the LLM needs to process less input data, resulting in a reduced computational load. This can lead to faster response times and lower computational costs.
- Improved cost-effectiveness. Most of the LLM providers charge based on the number of tokens (words or subwords) passed as part of the input context window and being processed. By using compressed prompts, the number of tokens is greatly reduced, leading to significant lower costs for each query or interaction with the LLM.
- Increased efficiency. Shorter and more concise prompts can help the LLM focus on the most relevant information, potentially improving the quality and accuracy of the generated responses and the output.
- Scalability. Prompt compression can result in improved performance, as the irrelevant words are ignored, making it easier to scale GenAI applications.
While prompt compression offers numerous benefits, it also presents some challenges that engineering team should consider while building generative-based applications:
- Potential loss of context. Compressing prompts too aggressively may lead to a loss of important context, which can negatively impact the quality of the LLM’s responses.
- Complexity of the task. Some tasks or prompts may be inherently complex, making it challenging to identify and retain the essential components without losing critical information.
- Domain-specific knowledge. Effective prompt compression requires domain-specific knowledge or expertise of the engineering team to accurately identify the most important elements of a prompt.
- Trade-off between compression and performance. Finding the right balance between the amount of compression and the desired performance can be a delicate process and might require careful tuning and experimentation.
To address these challenges, it’s important to develop robust prompt compression strategies customized to specific use cases, domains, and LLM models. It also requires continuous monitoring and evaluation of the compressed prompts and the LLM’s responses to ensure the desired level of performance and cost-effectiveness are being achieved.
Tools and Libraries We Used for Prompt Compression
Microsoft LLMLingua
Microsoft LLMLingua is a state-of-the-art toolkit designed to optimize and enhance the output of large language models, including those used for natural language processing tasks.
The primary purpose of LLMLingua is to provide developers and researchers with advanced tools to improve the efficiency and effectiveness of LLMs, particularly in generating more precise and concise text outputs. It focuses on the refinement and compression of prompts and makes interactions with LLMs more streamlined and productive, enabling the creation of more effective prompts without sacrificing the quality or intent of the original text.
LLMLingua offers a variety of features and capabilities in order to increase the performance of LLMs. One of its key strengths lies in its sophisticated algorithms for prompt compression, which intelligently reduce the length of input prompts while retaining their essential meaning of the content. This is particularly beneficial for applications where token limits or processing efficiency are concerns.
LLMLingua also includes tools for prompt optimization, which help in refining prompts to elicit better responses from LLMs. LLMLingua framework also supports multiple languages, making it a versatile tool for global applications.
These capabilities make LLMLingua an invaluable asset for developers seeking to enhance the interaction between users and LLMs, ensuring that prompts are both efficient and effective.
LLMLingua can be integrated with LLMs for prompt compression by following a few straightforward steps.
First, ensure that you have LLMLingua installed and configured in your development environment. This typically involves downloading the LLMLingua package and including it in your project’s dependencies. LLMLingua employs a compact, highly-trained language model (such as GPT2-small or LLaMA-7B) to identify and remove non-essential words or tokens from prompts. This approach facilitates efficient processing with large language models, achieving up to 20 times compression while incurring minimal loss in performance quality.
Once installed, you can begin by inputting your original prompt into LLMLingua’s compression tool. The tool then processes the prompt, applying its algorithms to condense the input text while maintaining its core message.
After the compression process, LLMLingua outputs a shorter, optimized version of the prompt. This compressed prompt can then be used as input for your LLM, potentially leading to faster processing times and more focused responses.
Throughout this process, LLMLingua provides options to customize the compression level and other parameters, allowing developers to fine-tune the balance between prompt length and information retention according to their specific needs.
Selective Context
Selective Context is a cutting-edge framework designed to address the challenges of prompt compression in the context of large language models.
By focusing on the selective inclusion of context, it helps to refine and optimize prompts. This ensures that they are both concise and rich in the necessary information for effective model interaction.
This approach allows for the efficient processing of inputs by LLMs. This makes Selective Context a valuable tool for developers and researchers looking to enhance the quality and efficiency of their NLP applications.
The core capability of Selective Context lies in its ability to improve the quality of prompts for the LLMs. It does so by integrating advanced algorithms that analyze the content of a prompt to determine which parts are most relevant and informative for the task at hand.
By retaining only the essential information, Selective Context provides streamlined prompts that can significantly enhance the performance of LLMs. This not only leads to more accurate and relevant responses from the models but also contributes to faster processing times and reduced computational resource usage.
Integrating Selective Context into your workflow involves a few practical steps:
- Initially, users need to familiarize themselves with the framework, which is available on
GitHub, and incorporate it into their development environment. - Next, the process begins with the preparation of the original, uncompressed prompt,
which is then inputted into Selective Context. - The framework evaluates the prompt, identifying and retaining key pieces of information
while eliminating unnecessary content. This results in a compressed version of the
prompt that’s optimized for use with LLMs. - Users can then feed this refined prompt into their chosen LLM, benefiting from improved
interaction quality and efficiency.
Throughout this process, Selective Context offers customizable settings, allowing users to adjust the compression and selection criteria based on their specific needs and the characteristics of their LLMs.
Prompt Compression in OpenAI’s GPT models
Prompt compression in OpenAI’s GPT models is a technique designed to streamline the input prompt without losing the critical information required for the model to understand and respond accurately. This is particularly useful in scenarios where token limitations are a concern or when seeking more efficient processing.
Techniques range from manual summarization to employing specialized tools that automate the process, such as Selective Context, which evaluates and retains essential content.
For example, take an initial detailed prompt like this:
Discuss in depth the impact of the industrial revolution on European socio-economic structures, focusing on changes in labor, technology, and urbanization.
This can be compressed to this:
Explain the industrial revolution’s impact on Europe, including labor, technology, and urbanization.
This shorter, more direct prompt still conveys the critical aspects of the inquiry, but in a more succinct manner, potentially leading to faster and more focused model responses.
Here are some more examples of prompt compression:
- Hamlet analysis
Original prompt:
Could you provide a comprehensive analysis of Shakespeare’s ‘Hamlet,’ including themes, character development, and its significance in English literature?
Compressed prompt:
Analyze ‘Hamlet’s’ themes, character development, and significance.
- Photosynthesis
Original prompt:
I’m interested in understanding the process of photosynthesis, including how plants convert light energy into chemical energy, the role of chlorophyll, and the overall impact on the ecosystem.
Compressed prompt:
Summarize photosynthesis, focusing on light conversion, chlorophyll’s role, and ecosystem impact.
- Story suggestions
Original prompt:
I am writing a story about a young girl who discovers she has magical powers on her thirteenth birthday. The story is set in a small village in the mountains, and she has to learn how to control her powers while keeping them a secret from her family and friends. Can you help me come up with some ideas for challenges she might face, both in learning to control her powers and in keeping them hidden?
Compressed prompt:
Story ideas needed: A girl discovers magic at 13 in a mountain village. Challenges in controlling and hiding powers?
These examples showcase how reducing the length and complexity of prompts can still retain the essential request, leading to efficient and focused responses from GPT models.
Conclusion
Incorporating prompt compression into enterprise applications can significantly enhance the efficiency and effectiveness of LLM applications.
Combining Microsoft LLMLingua and Selective Context provides a definitive approach to prompt optimization. LLMLingua can be leveraged for its advanced linguistic analysis capabilities to refine and simplify inputs, while Selective Context’s focus on content relevance ensures that essential information is maintained, even in a compressed format.
When selecting the right tool, consider the specific needs of your LLM application. LLMLingua excels in environments where linguistic precision is crucial, whereas Selective Context is ideal for applications that require content prioritization.
Prompt compression is key for improving interactions with LLM, making them more efficient and producing better results. By using tools like Microsoft LLMLingua and Selective Context, we can fine-tune AI prompts for various needs.
If we use OpenAI’s model, then besides integrating the above tools and libraries we can also use the simple NLP compression technique mentioned above. This ensures cost saving opportunities and improved performance of the RAG based GenAI applications.
Suvoraj is a Subject Matter Expert and thought leader in the Enterprise Generative AI and its Architecture & Governance and author of Enterprise Generative AI Architecture - An Architect’s Guide to adopting Generative AI at Enterprise, which has received accolades from renowned industry veterans globally on this new cutting edge domain. Professionally, he works as Solution Architect (Enterprise) at a prominent Fortune 500 financial corporation. He has been chosen as AWS Community Builder by AWS in 2024. You can learn more on his website.