Fine-tuning techniques: The term “fine tuning” refers to further training a pretrained model. In the case of LLMs, this means that we take a pretrained foundation model and train it some more. But, there are so many different ways that this training can be done, which makes the concept of fine tuning incredibly vague. This single term can refer to a variety of different techniques, such as:
- Continued pretraining
- Instruction tuning
- Supervised fine tuning (SFT)
- Reinforcement Learning from Human Feedback (RLHF) or Direct Preference Optimization (DPO)
What is the goal of these techniques? For language models, there are two primary goals that a practitioner will have when performing fine tuning:
- Knowledge injection: Teach the model how to leverage new sources of knowledge (not present during pretraining) when solving problems.
- Alignment (or style/format specification): Modify the way in which the language model surfaces its existing knowledge base; e.g., abide by a certain answer format, use a new style/tone of voice, avoid outputting incorrect information, and more.
Given this information, we might wonder: Which fine-tuning techniques should we use to accomplish either (or both) of these goals? To answer this question, we need to take a much deeper look at recent research on the topic of fine tuning.
Large-scale instruction tuning: Prior to the release of modern open-source LLMs, it was very common to fine tune pretrained LLMs on massive instruction tuning datasets. Such an approach was popularized by models like FLAN [1] (from Google), which carry out instruction tuning of pretrained language fashions over massive datasets. Within the case of FLAN, for instance, the FLANv2 instruction tuning dataset accommodates over 15M examples—very massive! By following this strategy, FLAN can study to resolve numerous completely different downstream duties in an environment friendly method.
“We present that by coaching a mannequin on these directions it not solely turns into good at fixing the sorts of directions it has seen throughout coaching however turns into good at following directions normally.” – from FLAN paper [1]
Past data injection: After the proposal of ChatGPT, we noticed a rise within the want to align language fashions and adapt their output format to a selected model or construction. Such a purpose is drastically completely different than educating an LLM to resolve a brand new process. Once we try to show an LLM new data, extra knowledge is at all times higher (therefore the big instruction tuning datasets utilized by fashions like FLAN). Nonetheless, aligning the language mannequin to a sure model or construction of output doesn’t require studying new info! So, perhaps alignment-focused objectives require much less intensive high quality tuning.
Much less is extra for alignment: Analysis on the subject of LLM high quality tuning was catalyzed by the discharge of LLaMA [2] (and later LLaMA-2 [3]), which made high-quality basis LLMs overtly accessible. Shortly after LLaMA, authors from Meta revealed LIMA [4], which confirmed that alignment-style high quality tuning will be completed with little or no knowledge. Specifically, the purpose of alignment is to adapt the LLM’s model (quite than to study new info), which will be completed through a small, high-quality, and numerous high quality tuning dataset. Such findings revealed that the majority of an LLM’s data comes from pretraining, and the LLM learns the right model throughout alignment (see quote under).
“A mannequin’s data and capabilities are learnt virtually completely throughout pretraining, whereas alignment teaches it which subdistribution of codecs must be used when interacting with customers.” – from LIMA paper [4]
Imitating proprietary LLMs: Following LIMA, a large variety of high-quality, high quality tuned LLMs (e.g., Alpaca, Vicuna, Koala, Orca, and extra) had been created by high quality tuning LLaMA over small artificial high quality tuning datasets of GPT-3.5/4 outputs. On this means, we might practice these fashions to mimic the output of extra highly effective LLMs. When evaluated in human trials and on simplistic benchmarks, these fashions appeared to match (or exceed) the efficiency of highly effective fashions like ChatGPT. For that reason, practitioners started to consider that we might surpass fashions like GPT-4 or ChatGPT by performing a small quantity of (cheap) high quality tuning.
What’s going on right here? Clearly, coaching a mannequin like ChatGPT can’t be achieved this simply. Researchers shortly discovered some limitations within the work achieved on imitation fashions [5]:
– People are simply tricked if the model of the LLM is nice, and (as proven by LIMA) these fashions can shortly study to imitate the model of fashions like ChatGPT with little knowledge.
– The benchmarks that had been used are too restricted. The fashions carry out properly when evaluated by a small group of people, however their efficiency falls aside on extra intensive benchmarks that embrace conventional, perplexity-based evaluations (e.g., regular NLP benchmarks).
We are able to study sure issues (e.g., model and output format) from high quality tuning over a small quantity of information, however we are able to’t study every part! These imitation fashions lack the data base of extra highly effective LLMs, which may solely be discovered from massive quantities of information.
Placing every part collectively: Given all the info we’ve lined to date, there are a number of takeaways that we are able to deduce:
- Most data from an LLM comes from pretraining.
- We are able to carry out high quality tuning within the type of continued pretraining to show the LLM to extra (and new) knowledge/data.
- Alignment-focused targets will be achieved through high quality tuning (SFT) on small, high-quality datasets. We don’t want tons of information to study the model or format of output, solely to study new data.
When performing high quality tuning, it’s essential that we all know which purpose—both alignment or data injection—that we’re aiming for. Then, we must always put benchmarks in place that permit us to precisely and comprehensively assess whether or not that purpose was completed or not. Imitation fashions failed to do that, which led to a bunch of deceptive claims/outcomes!
Ongoing work: The story doesn’t cease right here! In reality, the excellence between pretraining and high quality tuning continues to be fairly imprecise. At what level does the LLM begin really studying new data as a substitute of simply studying model/alignment? Many latest publications are persevering with to review this query:
- Finetuning vs. RAG [6]: authors discover that continued pretraining will not be tremendous efficient at data injection, whereas RAG is definitely extremely efficient at specializing an LLM to a brand new data base.
- LIMIT [7]: authors from MosiacML/Databricks present that we are able to carry out finetuning over a small combination of instruction tuning and alignment-focused knowledge, resulting in a mannequin that performs properly in each NLP benchmarks and style-focused evaluations.
- TULU [8]: authors topic finetuned LLMs to broader evaluations, discovering that the standard of the bottom mannequin has a large impression on efficiency and that nobody finetuning dataset/technique yields the very best outcomes throughout all benchmarks.
- TULU-2 [9]: authors present that finetuning LLMs over particular datasets results in the mannequin studying particular expertise and domains of information. Finetuning works properly if we make certain the finetuning dataset is extremely related to the model/area of analysis we’re utilizing.
- AlpaGasus [10]: authors straight examine how a lot finetuning knowledge is important for an LLM to carry out properly on varied downstream duties.
Bibliography:
[1] Wei, Jason, et al. “Finetuned language fashions are zero-shot learners.” arXiv preprint arXiv:2109.01652 (2021).
[2] Touvron, Hugo, et al. “Llama: Open and environment friendly basis language fashions.” arXiv preprint arXiv:2302.13971 (2023).
[3] Touvron, Hugo, et al. “Llama 2: Open basis and fine-tuned chat fashions.” arXiv preprint arXiv:2307.09288 (2023).
[4] Zhou, Chunting, et al. “Lima: Much less is extra for alignment.” Advances in Neural Data Processing Programs 36 (2024).
[5] Gudibande, Arnav, et al. “The false promise of imitating proprietary llms.” arXiv preprint arXiv:2305.15717 (2023).
[6] Ovadia, Oded, et al. “Tremendous-tuning or retrieval? evaluating data injection in llms.” arXiv preprint arXiv:2312.05934 (2023).
[7] Jha, Aditi, et al. “LIMIT: Much less Is Extra for Instruction Tuning Throughout Analysis Paradigms.” arXiv preprint arXiv:2311.13133 (2023).
[8] Wang, Yizhong, et al. “How far can camels go? exploring the state of instruction tuning on open sources.” Advances in Neural Data Processing Programs 36 (2024).
[9] Ivison, Hamish, et al. “Camels in a altering local weather: Enhancing lm adaptation with tulu 2.” arXiv preprint arXiv:2311.10702 (2023).
[10] Chen, Lichang, et al. “Alpagasus: Coaching a greater alpaca with fewer knowledge.” arXiv preprint arXiv:2307.08701 (2023).