Fine-tuning techniques: The term “fine tuning” refers to further training a pretrained model. In the case of LLMs, this means that we take a pretrained foundation model and train it some more. But, there are so many different ways that this training can be done, which makes the concept of fine tuning incredibly vague. This single term can refer to a variety of different techniques, such as:
- Continued pretraining
- Instruction tuning
- Supervised fine tuning (SFT)
- Reinforcement Learning from Human Feedback (RLHF) or Direct Preference Optimization (DPO)
What is the goal of these techniques? For language models, there are two primary goals that a practitioner will have when performing fine tuning:
- Knowledge injection: Teach the model how to leverage new sources of knowledge (not present during pretraining) when solving problems.
- Alignment (or style/format specification): Modify the way in which the language model surfaces its existing knowledge base; e.g., abide by a certain answer format, use a new style/tone of voice, avoid outputting incorrect information, and more.
Given this information, we might wonder: Which fine-tuning techniques should we use to accomplish either (or both) of these goals? To answer this question, we need to take a much deeper look at recent research on the topic of fine tuning.
Large-scale instruction tuning: Prior to the release of modern open-source LLMs, it was very common to fine tune pretrained LLMs on massive instruction tuning datasets. Such an approach was popularized by models like FLAN [1] (from Google), which carry out instruction tuning of pretrained language fashions over massive datasets. Within the case of FLAN, for instance, the FLANv2 instruction tuning dataset accommodates over 15M examples—very massive! By following this strategy, FLAN can study to resolve a lot of totally different downstream duties in an environment friendly method.
“We present that by coaching a mannequin on these directions it not solely turns into good at fixing the sorts of directions it has seen throughout coaching however turns into good at following directions basically.” – from FLAN paper [1]
Past information injection: After the proposal of ChatGPT, we noticed a rise within the need to align language fashions and adapt their output format to a selected model or construction. Such a aim is drastically totally different than instructing an LLM to resolve a brand new job. Once we are attempting to show an LLM new information, extra knowledge is all the time higher (therefore the massive instruction tuning datasets utilized by fashions like FLAN). Nevertheless, aligning the language mannequin to a sure model or construction of output doesn’t require studying new data! So, possibly alignment-focused targets require much less intensive wonderful tuning.
Much less is extra for alignment: Analysis on the subject of LLM wonderful tuning was catalyzed by the discharge of LLaMA [2] (and later LLaMA-2 [3]), which made high-quality basis LLMs brazenly accessible. Shortly after LLaMA, authors from Meta printed LIMA [4], which confirmed that alignment-style wonderful tuning will be achieved with little or no knowledge. Specifically, the aim of alignment is to adapt the LLM’s model (quite than to study new data), which will be achieved through a small, high-quality, and numerous wonderful tuning dataset. Such findings revealed that almost all of an LLM’s information comes from pretraining, and the LLM learns the proper model throughout alignment (see quote beneath).
“A mannequin’s information and capabilities are learnt nearly solely throughout pretraining, whereas alignment teaches it which subdistribution of codecs must be used when interacting with customers.” – from LIMA paper [4]
Imitating proprietary LLMs: Following LIMA, an enormous variety of high-quality, wonderful tuned LLMs (e.g., Alpaca, Vicuna, Koala, Orca, and extra) had been created by wonderful tuning LLaMA over small artificial wonderful tuning datasets of GPT-3.5/4 outputs. On this means, we may prepare these fashions to mimic the output of extra highly effective LLMs. When evaluated in human trials and on simplistic benchmarks, these fashions appeared to match (or exceed) the efficiency of highly effective fashions like ChatGPT. For that reason, practitioners started to consider that we may surpass fashions like GPT-4 or ChatGPT by performing a small quantity of (cheap) wonderful tuning.
What’s going on right here? Clearly, coaching a mannequin like ChatGPT can’t be executed this simply. Researchers shortly discovered some limitations within the work executed on imitation fashions [5]:
– People are simply tricked if the model of the LLM is nice, and (as proven by LIMA) these fashions can shortly study to imitate the model of fashions like ChatGPT with little knowledge.
– The benchmarks that had been used are too restricted. The fashions carry out properly when evaluated by a small group of people, however their efficiency falls aside on extra intensive benchmarks that embody conventional, perplexity-based evaluations (e.g., regular NLP benchmarks).
We will study sure issues (e.g., model and output format) from wonderful tuning over a small quantity of knowledge, however we will’t study every thing! These imitation fashions lack the information base of extra highly effective LLMs, which may solely be discovered from massive quantities of knowledge.
Placing every thing collectively: Given all the data we’ve coated up to now, there are a couple of takeaways that we will deduce:
- Most information from an LLM comes from pretraining.
- We will carry out wonderful tuning within the type of continued pretraining to show the LLM to extra (and new) knowledge/information.
- Alignment-focused aims will be achieved through wonderful tuning (SFT) on small, high-quality datasets. We don’t want tons of knowledge to study the model or format of output, solely to study new information.
When performing wonderful tuning, it’s crucial that we all know which aim—both alignment or information injection—that we’re aiming for. Then, we must always put benchmarks in place that permit us to precisely and comprehensively assess whether or not that aim was achieved or not. Imitation fashions failed to do that, which led to a bunch of deceptive claims/outcomes!
Ongoing work: The story doesn’t cease right here! In reality, the excellence between pretraining and wonderful tuning continues to be fairly obscure. At what level does the LLM begin really studying new information as a substitute of simply studying model/alignment? Many current publications are persevering with to check this query:
- Finetuning vs. RAG [6]: authors discover that continued pretraining is just not tremendous efficient at information injection, whereas RAG is definitely extremely efficient at specializing an LLM to a brand new information base.
- LIMIT [7]: authors from MosiacML/Databricks present that we will carry out finetuning over a small combination of instruction tuning and alignment-focused knowledge, resulting in a mannequin that performs properly in each NLP benchmarks and style-focused evaluations.
- TULU [8]: authors topic finetuned LLMs to broader evaluations, discovering that the standard of the bottom mannequin has an enormous influence on efficiency and that nobody finetuning dataset/technique yields one of the best outcomes throughout all benchmarks.
- TULU-2 [9]: authors present that finetuning LLMs over particular datasets results in the mannequin studying particular expertise and domains of knowledge. Finetuning works properly if we ensure the finetuning dataset is extremely related to the model/area of analysis we’re utilizing.
- AlpaGasus [10]: authors straight examine how a lot finetuning knowledge is critical for an LLM to carry out properly on numerous downstream duties.
Bibliography:
[1] Wei, Jason, et al. “Finetuned language fashions are zero-shot learners.” arXiv preprint arXiv:2109.01652 (2021).
[2] Touvron, Hugo, et al. “Llama: Open and environment friendly basis language fashions.” arXiv preprint arXiv:2302.13971 (2023).
[3] Touvron, Hugo, et al. “Llama 2: Open basis and fine-tuned chat fashions.” arXiv preprint arXiv:2307.09288 (2023).
[4] Zhou, Chunting, et al. “Lima: Much less is extra for alignment.” Advances in Neural Info Processing Programs 36 (2024).
[5] Gudibande, Arnav, et al. “The false promise of imitating proprietary llms.” arXiv preprint arXiv:2305.15717 (2023).
[6] Ovadia, Oded, et al. “Superb-tuning or retrieval? evaluating information injection in llms.” arXiv preprint arXiv:2312.05934 (2023).
[7] Jha, Aditi, et al. “LIMIT: Much less Is Extra for Instruction Tuning Throughout Analysis Paradigms.” arXiv preprint arXiv:2311.13133 (2023).
[8] Wang, Yizhong, et al. “How far can camels go? exploring the state of instruction tuning on open sources.” Advances in Neural Info Processing Programs 36 (2024).
[9] Ivison, Hamish, et al. “Camels in a altering local weather: Enhancing lm adaptation with tulu 2.” arXiv preprint arXiv:2311.10702 (2023).
[10] Chen, Lichang, et al. “Alpagasus: Coaching a greater alpaca with fewer knowledge.” arXiv preprint arXiv:2307.08701 (2023).