Salmon Run: Experiments with Immediate Compression

August 10, 2024

2

I lately got here throughout Immediate Compression (within the context of Immediate Engineering on Massive Language Fashions) on this quick course on Immediate Compression and Question Optimization from DeepLearning.AI. Basically it entails compressing the immediate textual content utilizing a educated mannequin to drop non-essential tokens. The ensuing immediate is shorter (and in instances of the unique context being longer than the LLM’s context restrict, not truncated) however retains the unique semantic that means. As a result of it’s quick, the LLM can course of it sooner and cheaper, and in some instances get across the Misplaced Within the Center issues noticed with lengthy contexts.

The course demonstrated Immediate Compression utilizing the LLMLingua library (paper) from Microsoft. I had heard about LLMLingua beforehand from my ex-colleague Raahul Dutta, who blogged about it on his Version 26: LLMLingua – A Zip Approach for Immediate publish, however on the time I believed possibly it was extra within the realm of analysis. Seeing it talked about within the DeepLearning.AI course made it really feel extra mainstream, so I attempted it out a single question from my area utilizing their Fast Begin instance, compressing the immediate with the small llmlingua-2-bert-base-multilingual-cased-meetingbank mannequin, and utilizing Anthropic’s Claude-v2 on AWS Bedrock because the LLM.

Compressing the immediate for the only question gave me a greater reply than with out compression, no less than going by inspecting the reply produced by the LLM earlier than and after compression. Inspired by these outcomes, I made a decision to judge the approach utilizing a set of round 50 queries I had mendacity round (together with a vector search index) from a earlier venture. This publish describes the analysis course of and the outcomes I obtained from it.

My baseline was a naive RAG pipeline, with the context retrieved by vector matching the question in opposition to the corpus, after which integrated right into a immediate that appears like this. The index is an OpenSearch index containing vectors of doc chunks, vectorization was finished utilizing the all-MiniLM-L6-v2 pre-trained SentenceTransformers encoder, and the LLM is Claude-2 (on AWS Bedrock as talked about beforehand).

Human: You're a medical skilled tasked with answering questions
expressed as quick phrases. Given the next CONTEXT, reply the QUESTION.

CONTEXT:
{context}

QUESTION: {query}

Assistant:

Whereas the construction of the immediate is fairly customary, LLMLingua explicitly requires the immediate to be composed of an instruction (the System immediate starting with Human:), the demonstration (the {context}) and the query (the precise quary to the RAG pipeline). The LLMLingua Compressor‘s compress operate expects these to be handed individually as parameters. Presumably, it compresses the demonstration with respect to the instruction and the query, i.e. context tokens which are non-essential given the instruction and query are dropped through the compression course of.

The baseline for the experiment makes use of the context as retrieved from the vector retailer with out compression, and we consider the results of immediate compression utilizing the 2 fashions listed in LLMLingua’s Fast Begin — llmlingua-2-bert-base-multilingual-cased-meetingbank (small mannequin) and llmlingua-2-bert-base-multilingual-cased-meetingbank (massive mannequin). The three pipelines — baseline, compression utilizing small mannequin, and compression utilizing massive mannequin — are run in opposition to my 50 question dataset. The examples indicate that the compressed immediate could be supplied as-is to the LLM, however I discovered that (no less than with the small mannequin), the ensuing compressed immediate generates solutions that doesn’t all the time seize the entire query’s nuance. So I ended up substituting solely the {context} a part of the immediate with the generated compressed immediate in my experiments.

Our analysis metric is Reply Relevance as outlined by the RAGAS venture. It’s a measure of how related the generated reply is given the query. To calculate this, we immediate the LLM to generate quite a lot of (in our case, upto 10) questions from the generated reply. We then compute the cosine similarity of the vector of every generated query with the vector of the particular query. The common of those cosine similarities is the Reply Relevance. Query Technology from the reply is completed by prompting Claude-2 and vectorization of the unique and generated questions are finished utilizing the identical SentenceTransformer encoder we used for retrieval.

Opposite to what I noticed in my first instance, the outcomes have been combined when run in opposition to the 50 queries. Immediate Compression does end in sooner response occasions, nevertheless it degraded the Reply Relevance scores extra occasions than enhance it. That is true for each the small and huge compression fashions. Listed below are plots of the distinction of the Reply Relevance rating for the compressed immediate in opposition to the baseline uncompressed immediate for every compression mannequin. The vertical pink line separates the instances the place compression is hurting reply relevance (left facet) versus bettering reply relevance (proper facet). Typically, it looks like compression helps when the enter immediate is longer, which intuitively is sensible. However there does not appear to be a easy option to know up entrance if immediate compression goes to assist or damage.

I used the next parameters to instantiate LLMLingua’s PromptCompressor object and to name its compress_prompt operate. These are the identical parameters that have been proven within the Fast Begin. It’s potential I’ll have gotten completely different / higher outcomes if I had experimented a bit with the parameters.

from llmlingua import PromptCompressor

compressor = PromptCompressor(model_name=model_name, use_llmlingua2=True)

compressed = compressor.compress_prompt(contexts, instruction=instruction, query=question,
    target_token=500, condition_compare=True, condition_in_question="after", 
    rank_method="longllmlingua", use_sentence_level_filter=False, context_budget="+100",
    dynamic_context_compression_ratio=0.4, reorder_context="type")
compressed_context = compressed["compressed_prompt"]

A number of observations in regards to the compressed context. The variety of context paperwork modifications earlier than and after compression. In my case, all enter contexts had 10 chunks, and the output would differ between 3-5 chunks, which most likely results in the elimination of Misplaced within the Center side-effects as claimed in LLMLingua’s documentation. Additionally, the ensuing context chunks are shorter and appears to be a string of key phrases reasonably than coherent sentences, mainly unintelligible to human readers, however intelligible to the LLM.

General, Immediate Compression looks like an attention-grabbing and really highly effective approach which can lead to financial savings in money and time if used judiciously. Their paper exhibits very spectacular outcomes on some customary benchmark datasets with supervised studying model metrics utilizing quite a lot of compression ratios. I used Reply Relevance as a result of it may be computed without having area specialists to grade extra solutions. However it’s possible that I’m lacking some necessary optimization, so I’m curious if any of you will have tried it, and in case your outcomes are completely different from mine. In that case, would recognize any tips to stuff you suppose I is perhaps lacking.

Previous articleIt’s Time To Discuss About “CSS5”

Salmon Run: Experiments with Immediate Compression

Desk Extraction from PDFs utilizing Multimodal (Imaginative and prescient) LLMs

Deploying Profitable GenAI-based Chatbots with much less Information and extra Peace of Thoughts. – Bitext. We assist AI perceive people.

Any Options to the infinite knowledge wants of GenAI? – Bitext. We assist AI perceive people.

LEAVE A REPLY Cancel reply

Most Popular

It’s Time To Discuss About “CSS5”

CSS Chronicles XLII | CSS-Tips

How we’re making Stack Overflow extra accessible

Christian Heilmann: Let’s Make A Easier, Extra Accessible Internet

Recent Comments

ABOUT US

POPULAR POSTS

It’s Time To Discuss About “CSS5”

CSS Chronicles XLII | CSS-Tips

How we’re making Stack Overflow extra accessible

POPULAR CATEGORY