Some Core Principles of Large Language Model (LLM) Tuning

Subrata Goswami
25 min readDec 31, 2023

--

Large Language Models ( LLM) such as Chat-GPT , Llama2, etc have taken the world by storm this year. What seems like a recent phenomenon, is actually culmination of painstaking research and development over a number of years. LLM is a very rapidly developing field and within a very short time has become broad. GPTx from OpenAI ( and LLama2 from Meta to some extent) is the trend setting model with significant amount of publications that has shaped the space so far. This piece explores some of the important concepts (without going into code or lot of math) that are at the heart of such models.

The content of this piece uses texts and images ( original or modified ) liberally from some of the listed references.

Pre-Training

One of the significant breakthroughs in the language model space has been the ability to do zero/few-shot learning without needing to retrain the parameters. This imparts LLMs the ability to easily adapt to context based question-answer — the raw power behind Chat-GPT like software. GPT2 leveraged this advance, called MQAN ( multitask question answering network ). Prior multitasking approaches such as MAML were complex and not scalable. MQAN is an unsupervised approach that relied on language to flexibly specify tasks, inputs, and outputs as a sequence of symbols.

From the GPT2 paper — “For example, a translation training example can be written as the sequence (translate to french, english text, french text). Likewise, a reading comprehension training example can be written as (answer the question, document, question, answer). …. Our speculation is that a language model with sufficient capacity will begin to learn to infer and perform the tasks demonstrated in natural language sequences in order to better predict them, regardless of their method of procurement. If a language model is able to do this it will be, in effect, performing unsupervised multitask learning. We test whether this is the case by analyzing the performance of language models in a zero-shot setting on a wide variety of tasks.“

Hence in modern LLM’s, pre-training on billions to trillions of tokens is unsupervised.

Another significant breakthrough benefiting LLMs is the Transformer architecture. Transformers removed the sequential bottlenecks that prevented previous language models such as RNN/LSTM fron scsaling.

The other developments that helped are increases in GPU compute capability ( silicon technology and design ) and the software/frameworks improvements. Transformers are very compute and memory intensive, and without the silicon and algorithm improvements, it would not have been practical to train transformer models.

Another very significant catalyst is availability of data used for autoregressive unsupervised pre-training. This would not have been possible without the readily available content over the Internet.

Fine-tuning

The pre-trained LLM’s are further trained ( or tuned ) to adapt to different uses cases (e.g chatbot, summerization, content generation, etc.), domains (e.g. finance, health, ) , content (e.g. non-harmful, helpful), etc.

The commonly used (and published) fine-tuning methods are as follows.

  • Supervised fine tuning (SFT) — instruction fine tuning
  • Reinforcement learning with human feedback (RLHF)

SFT minimizes the loss between what the model outputs and what the correct result is token by token. Both InstructGPT and Llama papers found that SFT alone does not sufficiently improve model output.

The GPT3 paper did not use fine-tuning and instead the results were all with in-context information ( zero/one/two-shot ). From the GPT3 paper —

“The main advantage of fine-tuning is strong performance on many benchmarks. The main disadvantages are the need for a new large dataset for every task, the potential for poor generalization out-of-distribution [MPL19], and the potential to exploit spurious features of the training data [GSL+18, NK19], potentially resulting in an unfair comparison with human performance. In this work we do not fine-tune GPT-3 because our focus is on task-agnostic performance, but GPT-3 can be fine-tuned in principle and this is a promising direction for future work.”

The GPT3 paper shows results of about 65 evaluations spanning — language modelling, Q&A, translation, common sense reasoning, reading comprehension, etc. On many of theses tests ( listed in Table H.1 of the paper), the GPT3 model bettered or equaled fine-tuned SOTA models.

GPT4 uses fine-tuning , both SFT and RLHF. The paper claims that RLHF fine-tuning makes the models significantly safer.

LLama2 also uses SFT and RLHF. From the Llama2 paper —

“These closed product LLMs are heavily fine-tuned to align with human preferences, which greatly enhances their usability and safety. This step can require significant costs in compute and human annotation, and is often not transparent or easily reproducible, limiting progress within the community to advance AI alignment research.”

Reinforcement Learning with Human Feedback (RLHF) :

In RLHF, the policy is a supervised fine tuned version of LLM. The reward ( function or model, RM) is also the same LLM with a categorical head instead of a next word generation head. The RM is trained on the user preference data that is collected over a period of time. The RM is a proxy for the environment. The LLM is then fine tuned using the RM and optimization algorithm such as Proximal Policy Optimization (PPO) . The collection of user preference data and model update forms a loop with newer user preference data resulting in more fine tuning. The following pictures from the InstructGPT and the Llama-2 papers illustrate the process.

InstructGPT RLHF process
Llama-2-Chat RLHF process

The Reward Model approach was detailed in a 2018 ( see reference “Scalable agent alignment via reward modeling:
a research direction
“ ) paper by DeepMind. The idea has been used in earlier publications such as 2016 “Deep Reinforcement Learning from Human Preferences” by OpenAI and Google.

Reward Model from the “Scalable agent … “ paper.

“the ultimate goal of machine learning (ML) research is to go beyond games and improve human lives. To achieve this we need ML to assist us in real-world domains, ranging from simple tasks like ordering food or answering emails to complex tasks like software engineering or running a business. Yet performance on these and other real-world tasks is not easily measurable, since they do not come readily equipped with a reward function. Instead, the objective of the task is only indirectly available through the intentions of the human user”

“On the one hand, we want ML to generate creative and brilliant …. On the other hand, we want to avoid degenerate solutions that lead to undesired behavior like exploiting a bug … In order to differentiate between these two outcomes, our agent needs to understand its user’s intentions, and robustly achieve these intentions with its behavior. We frame this as the agent alignment problem: How can we create agents that behave in accordance with the user’s intentions? “

“We break the problem into two parts: (1) learning a reward function from the feedback of the user that captures their intentions and (2) training a policy with reinforcement learning to optimize the learned reward function. In other words, we separate learning what to achieve (the ‘What?’) from learning how to achieve it (the ‘How?’). We call this approach reward modeling. Figure 1 illustrates this setup schematically.”

“Learning a reward function separately from the agent’s policy allows us to disentangle the agent’s objective from its behavior. If we understand the reward function, we know what the agent is optimizing for; in particular, we know whether its intentions are aligned with the user’s intentions. This has three advantages that could help make reward modeling economical:

  • 1. The user does not have to provide feedback on every interaction between agent and environment, as would be the case if we trained a policy from user feedback directly. Since deep RL algorithms tend to be very sample-inefficient (e.g. taking weeks of real-time to learn to play an Atari game), providing feedback on every interaction is usually not practical.
  • 2. We can distinguish between alignment of the policy and alignment of the reward model (Ibarz et al., 2018).
  • 3. We can leverage progress on deep RL agents by plugging a more capable agent into our reward modeling setup.
  • 4. The user does not need to solve the credit assignment problem”

The Reward Model approach for text summarization was used in the paper “Better Rewards Yield Better Summaries: Learning to Summarise Without Reference” . There the authors formulated a Cross-entropy loss function that expanded the granularity exponentially through comparison of pairs of outputs. This approach is used by many LLMs currently.

Cross Entropy Loss from paper ““Better Rewards Yield …”

Both of the above techniques were then combined by author’s of the paper “Learning to summarize from human feedback” ( precursor to InstructGPT or proto-InstructGPT). The paper uses a filtered version of the Reddit TL;DR dataset. From this paper.

“We follow the works of [3, 73], who fine-tune language models from human feedback using reward learning [35]. We first collect a dataset of human preferences between pairs of summaries, then train a reward model (RM) via supervised learning to predict the human-preferred summary. Finally, we train a policy via reinforcement learning (RL) to maximize the score given by the RM; the policy generates a token of text at each ‘time step’, and is updated using the PPO algorithm [58] based on the RM ‘reward’ given to the entire generated summary. We can then gather more human data using samples from the resulting policy, and repeat the process. We follow the works of [48, 4] and use large pretrained GPT-3 models with as many as 6.7 billion parameters.”

The following diagram shows the RLHF process used in the above paper .

Proto-InstructGPT fine tuning ( from apper “Learning to summarize… “)

The steps in the process are as follows.

“Step 1: Collect samples from existing policies and send comparisons to humans. For each Reddit post, we sample summaries from several sources including the current policy, initial policy, original reference summaries and various baselines. We send a batch of pairs of summaries to our human evaluators, who are tasked with selecting the best summary of a given Reddit post.

Step 2: Learn a reward model from human comparisons. Given a post and a candidate summary, we train a reward model to predict the log odds that this summary is the better one, as judged by our labelers.

Step 3: Optimize a policy against the reward model. We treat the logit output of the reward model as a reward that we optimize using reinforcement learning, specifically with the PPO algorithm [58].”

“Crucially, we also filter to include only posts where the human-written summaries contain between 24 and 48 tokens, to minimize the potential effect of summary length on quality (see Section 4.1 and Appendix F). Our final filtered dataset contains 123,169 posts, and we hold out ~5% as a validation set”

“(4) We publicly release our human feedback dataset for further research. The dataset contains 64,832 summary comparisons on the TL;DR dataset, as well as our evaluation data on both TL;DR (comparisons and Likert scores) and CNN/DM (Likert scores).”

“All of our models are Transformer decoders [62] in the style of GPT-3 [47, 4]. We conduct our human feedback experiments on models with 1.3 billion (1.3B) and 6.7 billion (6.7B) parameters.”

“Pretrained models. Similarly to [12, 47], we start with models pretrained to autoregressively predict the next token in a large text corpus.”

“Supervised baselines. We next fine-tune these models via supervised learning to predict summaries from our filtered TL;DR dataset (see Appendix B for details). We use these supervised models to sample initial summaries for collecting comparisons, to initialize our policy and reward models, and as baselines for evaluation.”

“Reward models. To train our reward models, we start from a supervised baseline, as described above, then add a randomly initialized linear head that outputs a scalar value. We train this model to predict which summary y ∈ {y0, y1} is better as judged by a human, given a post x.”

“Human feedback policies. We want to use the reward model trained above to train a policy that generates higher-quality outputs as judged by humans. We primarily do this using reinforcement learning, by treating the output of the reward model as a reward for the entire summary that we maximize with the PPO algorithm [58], where each time step is a BPE token⁸. We initialize our policy to be the model fine-tuned on Reddit TL;DR. Importantly, we include a term in the reward that penalizes the KL divergence between the learned RL policy πRLφ with parameters φ and this original supervised model πSFT, as previously done in [25].”

Note that the reward model only gives rewards for entire summaries, and not at intermediate time steps. In RL terminology, each episode terminates when the policy outputs the EOS token, and the discount factor γ = 1.”

As the reward model only provide judgement on the entire summary rather than each generated token, hence SFT is not suitable, and RL is better suited for such non-immediate and non-granular feedback ( i.e. loss) .

“For the PPO value function, we use a Transformer with completely separate parameters from the policy. This prevents updates to the value function from partially destroying the pretrained policy early in training (see ablation in Appendix G.1). We initialize the value function to the parameters of the reward model. In our experiments, the reward model, policy, and value function are the same size.”

“For PPO, we run with separate policy and value networks, initializing our policies to the supervised baseline, and our value functions to the reward model.”

InstructGPT paper, “Training language models to follow instructions with human feedback”, improved upon the previous paper. The steps are shown in the following picture.

InstructGPT fine tuning ( from the InstructGPT paper)

“Starting from the SFT model with the final unembedding layer removed, we trained a model to take in a prompt and response, and output a scalar reward. In this paper we only use 6B RMs, as this saves a lot of compute, and we found that 175B RM training could be unstable and thus was less suitable to be used as the value function during RL (see Appendix C for more details).”

“Once again following Stiennon et al. (2020), we fine-tuned the SFT model on our environment using PPO (Schulman et al., 2017). The environment is a bandit environment which presents a random customer prompt and expects a response to the prompt. Given the prompt and response, it produces a reward determined by the reward model and ends the episode. In addition, we add a per-token KL penalty from the SFT model at each token to mitigate over optimization of the reward model. The value function is initialized from the RM. We call these models “PPO.” “

“In order to speed up comparison collection, we present labelers with anywhere between K = 4 and K = 9 responses to rank. “

“The final reward model was initialized from a 6B GPT-3 model that was fine-tuned on a variety of public NLP datasets (ARC, BoolQ, CoQA, DROP, MultiNLI, OpenBookQA, QuAC, RACE, and Winogrande). This was mostly for historical reasons; we find similar results when initializing the RM from the GPT-3 or SFT models.”

“Over the course of the project, we trained several reward models and policies. Each batch of summaries that we sent to the labelers were sampled from a variety of policies. We didn’t have a systematic plan for which policies to sample from; rather, we chose what seemed best at the time in the spirit of exploratory research. Every time we trained a reward model, we trained on all labels we had collected so far. Successive models also benefited from improved hyperparameters and dataset cleaning.”

Transformer Networks

The transformer architecture that has revolutionized Generative AI has two major parts — encoder and decoder. Some very well known subsequent architectures made use of only one part.

For text generation, the decoder part of the transformer architecture is used. It is trained in an auto-regressive manner to predict the next word — which turns out to be very suitable for what is needed for generation. Words after the next to the right are masked duting pre-training.

The encoder part of transformer learns to encode complete sentences. Random word in the sentence is masked during pre-training. These models are suited for language understanding, such as classification and sentiment analysis, etc.

BERT (Bidirectional Encoder Representations from Transformers) is an encoder only model family. T5 is an encoder-decoder model famly. GPTs and Lllama’s are decoder only models.

GPT:

As mentioned previously, GPT ‘s(Generative Pretrained Transformer) are decoder only models. OpenAI’s paper mentioned 4 GPT2 models ranging from 117 M to 1.5 B parameters. However, they only released the 3 smaller version of the GPT2 model. The codes for GPT 3/3.5/4 have not been open sourced also.

GPT-3 comes in 8 sizes, ranging from 125M to 175B parameters. The largest GPT-3 model is 175B parameters and is called davinci. The smallest GPT-3 model is 125 M parameters and roughly the size of BERT-Base and RoBERTa-Base.

According to the OpenAI’s GPT3 paper, the following are the major differences with GPT2 architecture.

“same model and architecture as GPT-2 [RWC+19], including the modified initialization, pre-normalization, and reversible tokenization described therein, with the exception that we use alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer [CGRS19]. To study the dependence of ML performance on model size, we train 8 different sizes of model, ranging over three orders of magnitude from 125 million parameters to 175 billion parameters, with the last being the model we call GPT-3.”

InstructGPT is an RLHF tuned GPT3. It is trained to follow an instruction in a prompt and provide a detailed response that is safer, more helpful, and more aligned with what users wants. GPT3 alone was not able to do that because it is trained to predict the next word on a large dataset of Internet text. From the InstructGPT paper —

“ Our labelers prefer outputs from our 1.3B InstructGPT model over outputs from a 175B GPT-3 model, despite having more than 100x fewer parameters.”. The InstructGPT paper goes over SFT, RM and PPO is details.

ChatGPT is a fine tuned GPT3.5 with RLHF. A GPT3.5 model is first fine tuned with with data generated by human AI trainer as both user and AI assistant, mixed with the InstructGPT dataset. The reward model is trained on responses ranked by quality by human AI trainers. Multiple responses were generated by “We randomly selected a model-written message, sampled several alternative completions, and had AI trainers rank them”. No further detail was provided on “sampled several alternative completions”! Finally, with the reward model, Proximal Policy Optimization (PPO) was used to fine-tune the model.

GPT4 paper does not provide any detail on the architecture or dataset beyond the following.

“GPT-4 is a Transformer-style model [39] pre-trained to predict the next token in a document, using both publicly available data (such as internet data) and data licensed from third-party providers. The model was then fine-tuned using Reinforcement Learning from Human Feedback (RLHF) [40]. Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar.”

There are a number of open sourced GPT type alternatives such as GPT-J, GPT-Neo, etc., released by non-OpenAI entities.

GPT-J-6B is an open source artificial intelligence language model developed by EleutherAI

GPT-Neo are several models — from 125M to 2.7B parameters. It is an implementation of model & data parallel GPT like models built with the Mesh-TensorFlow library. They are trained on the Pile dataset. The architecture is similar to GPT2 except that GPT Neo uses local attention in every other layer with a window size of 256 tokens.

GPT-NeoX is a 20 billion parameter autoregressive language model trained on the Pile dataset from EleutherAI. It is based on NVIDIA’s Megatron Language Model and has been augmented with techniques from DeepSpeed as well as some other novel optimizations.

Nvidia provides a GPT-3 code based on Megatron-LM available for pre-training that they used for MLPerf pre-training benchmarks. Some diffference of this model from the GPT-3 paper are — Tokenizer changed from BPE to SentencePiece with BPE, and alternating sparse attention layers not used. It uses the HF C4 dataset. Megatron is a large, powerful transformer developed by NVIDIA.

LLAMA2:

LLama2 is also a decoder only model. There are 4 variants at this point — 7B, 13B, 34B and 70B. There are instruction-tuned and RLHF versions of the base models called Llama2-chat.

Fine tuning is similar as in GPTx — supervised fine tuning followed by RLHF fine tuning. Human annotators first write a prompt, then choose between two sampled model responses, based on provided criteria. In order to maximize the diversity, the two responses to a given prompt are sampled from 2 different model variants ( 7B and 13B ? ), and varying the temperature hyper-parameter. Temperature sampling is a softmax based method that uses a temperature parameter between 0 and 1 as a denominator in the exponents.

The Llama2 fine-tuned models ( Llama2-chat ) used both public instruction-tuning data and a high-quality Meta in-house data for supervised fine tuning (SFT). The dataset contains a total of 27,540 annotations. For SFT, each sample consists of a prompt and an answer. To ensure the model sequence length is properly filled, they concatenated all the prompts and answers from the training set. A special token was used to separate prompt from answer. SFT is done with auto-regressive objective where the loss on tokens from the user prompt are zero’ed out, and hence back-propagation was done only on answer tokens.

RLHF was applied to the SFT’ed model to further align model behavior with human preferences and instruction following. A binary comparison protocol was chosen for annotation. Human annotators design a prompt. Output from two different model variants along with different temperatures were generated for each prompt. The annotators also grade their choices as significantly better, better, slightly better, or negligibly better/ unsure. Human annotators then select one of two model outputs they prefer. The reward model learns the preferences of human annotators and which then can automate preference decisions.

The Reward Model architecture and hyper-parameters are identical to those of the pre-trained language models, except that the classification head for next-token prediction is replaced with a regression head to output a scalar reward.

RLHF fine-tuning was done with two main algorithms — Proximal Policy Optimization (PPO) and Rejection Sampling fine-tuning. Rejection Sampling here means selecting one output of a few outputs for the same prompt. The different outputs are generated by slightly different models.

Datasets for Pre-training and Fine-tuning:

Pre-training datasets:

The pre-training dataset used for GPTx and LlamaX are not open sourced.

GPT3 used a filtered and deduped version of the 1-trillion word Common Crawl dataset augmented with an expanded version of the WebText dataset, 2 internet-based books corpora (Books1 and Books2) and English-language Wikipedia .

Llama2 models were pre-trained on 2 trillon tokens of a new mix of data from publicly available sources. The dataset does not include data from Meta’s products or services, and avoids sites with personal information.

HuggingFace C4 dataset is used by Nvidia’s Megatron-LM

Pile from EleutherAI used by both GPT-Neo and GPT-NeoX

Fine-tuning datasets:

“Learning to summarize from human feedback” paper ( precursor to InstructGPT or proto-InstructGPT) used a filtered version of the Reddit posts TL;DR dataset for SFT as “summarization task is significantly more challenging than on CNN/DM”. This dataset contains about 3 million posts from reddit.com across a variety of topics (subreddits), as well as summaries of the posts written by the original poster (the TL;DRs ). The filtered dataset contains 123,169 posts, of which about 5% was set aside for validation.

The same dataset is also used for RLHF RM and policy training along with the additional human preference feedback. The authors provide inference code for 1.3B models and baselines, as well as a model card and the human feedback dataset with over 64k summary comparisons.

InstructGPT’s “prompt dataset consists primarily of text prompts submitted to the OpenAI API, specifically those using an earlier version of the InstructGPT models (trained via supervised learning on a subset of our demonstration data) on the Playground interface”

“To train the very first InstructGPT models, we asked labelers to write prompts themselves. This is because we needed an initial source of instruction-like prompts to bootstrap the process, and these kinds of prompts weren’t often submitted to the regular GPT-3 models on the API.”

“ We asked labelers to write three kinds of prompts:

  • Plain: We simply ask the labelers to come up with an arbitrary task, while ensuring the tasks had sufficient diversity.
  • Few-shot: We ask the labelers to come up with an instruction, and multiple query/response pairs for that instruction.
  • User-based: We had a number of use-cases stated in waitlist applications to the OpenAI API. We asked labelers to come up with prompts corresponding to these use cases.

From these prompts, we produce three different datasets used in our fine-tuning procedure: (1) our SFT dataset, with labeler demonstrations used to train our SFT models, (2) our RM dataset, with labeler rankings of model outputs used to train our RMs, and (3) our PPO dataset, without any human labels, which are used as inputs for RLHF fine-tuning. The SFT dataset contains about 13k training prompts (from the API and labeler-written), the RM dataset has 33k training prompts (from the API and labeler-written), and the PPO dataset has 31k training prompts (only from the API).”

Evaluation Datasets:

HuggingFace Open LLM Leaderboard uses ARC, HellaSwag, MLLU, Truthful QA, Winnogrande, GSM8K and DROP.

InstructGPT authors performed automatic evaluations on the following benchmark datasets:

“Winogender (Rudinger et al., 2018), CrowS-Pairs (Nangia et al., 2020), RealToxicityPrompts (Gehman et al., 2020), TruthfulQA (Lin et al., 2021), DROP (Dua et al., 2019), QuAC (Choi et al., 2018), SquadV2 (Rajpurkar et al., 2018), Hellaswag (Zellers et al., 2019), SST (Socher et al., 2013), RTE and WSC (both part of SuperGLUE (Wang et al., 2019)), WMT 15 Fr → En (Bojar et al., 2015), CNN/Daily Mail Summarization (Nallapati et al., 2016), and Reddit TLDR Summarization (Völske et al., 2017).”

Llama2 evaluated and compared the pre-trained models with a number open-source (MPT, Falcon) and closed-source (GPT 3.5/4, PALM-1/2L) models. For open-source comparison , they used both non-aggregated and aggregated benchmarks as follows

  • Non-aggregated — HumanEval and MBPP for Code; PIQA, SIQA, HellaSwag, WinGrande, ARC easy and challenge, OpenBookQA and CommonsenseQA for Commonsense Reasoning; NaturalQuestions and TriviaQA for World Knowledge; SQuAD, QuAC and BoolQ for Reading Comprehension; GSM8K and MATH for Mathematics.
  • Aggregated — MATH, MMLU, BBH, AGI Eval

MMLU (Massive Multitask Language Understanding) is a benchmark designed to measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This makes the benchmark more challenging and more similar to how we evaluate humans. The benchmark covers 57 subjects across STEM, the humanities, the social sciences, and more. It ranges in difficulty from an elementary level to an advanced professional level, and it tests both world knowledge and problem solving ability. Subjects range from traditional areas, such as mathematics and history, to more specialized areas like law and ethics. The granularity and breadth of the subjects makes the benchmark ideal for identifying a model’s blind spots.

MATH and GSM-8K are commonly used benchmarks for mathematical reasoning.

HellaSwag is commonsense inference dataset.

TruthfulQA is a benchmark made up of questions designed to cause imitative falsehoods (e.g. “What is 1241 × 123?”, GPT-3 outputs “14812” ).

Winnogrande is a commonsense reasoning large-scale dataset of 44k problems, inspired by the original WSC design, but adjusted to improve both the scale and the hardness of the dataset.

Big Bench Hard (BBH) — BIG ( Beyond the imitation game) Bench is a set of 204 or more language tasks. The benchmark tasks are novel, and cover a diverse range of topics and languages. The BBH version is a set of 23 challenging BIG-Bench tasks.

AGI Eval is a novel benchmark specifically designed to assess foundation model in the context of human-centric standardized exams, such as college entrance exams, law school admission tests, math competitions, and lawyer qualification tests.

Appendix:

Reinforcement Learning — in supervised learning, with easily implementable cost function and gradient descent optimization, it is relatively easy to yield excellent results with relatively little hyperparameter tuning. Unlike supervised learning, reinforcement learning (RL) does not require labelled input/output pairs, and is better suited for goals such as somewhat subjective human safety and alignment, that are not captured well by simple metrics/loss functions.

There are a 2 significant components in RL — agent (+policy) and environment (+reward). The agent, through its policy, explores the environment and accumulates rewards offered by the environment. The rewards are usually sporadic, not immediate ( e.g. after multiple steps of agent’s exploration ), not easily expressed quantitatively such as loss, etc.

RL algorithms can be model based or model free. Having a model of the environment and agent a priori is not possible in most situations. Hence, model free approach is often used. Q-learning and SARSA are the most well known in the model free category.

A policy , π(a|s), is a rule or function used by an agent to decide what actions (a) to take in a state (s). It can be deterministic, in which case it is usually denoted by μ, or it may be stochastic, in which case it is usually denoted by π . Parameters of a policy is denoted by θ .

Given a policy π and discount factor γ ∈ [0, 1], a state value function is an expected sum of discounted rewards (r) over a number of steps (also called horizon).

Vπ (s(t) = s) = Eπ[r(t) + γr(t+1) + γ²r(t+2) + γ³r(t+3) + · · · |s(t) = s] # follow π

Q is the state-action value of a policy is defined as follows.

Qπ (s, a) = R(s, a) + γ Σₛₚ P(sp|s, a)Vπ (sp) # Take action a, then follow π

The main goal of Reinforcement learning is to find the optimal policy πᵒᵖᵗ that maximizes the expected cumulative reward

πᵒᵖᵗ(s) = arg max (Vπ(s) )

Value function can be represented by a lookup table where each state has a corresponding entry, Vπ(s), or each state-action pair has an entry, Qπ(s, a). This is the essence of model free approach called Q-learning. However, lookup table approach does not generalize well with very large state and/or action spaces, or in other cases it might be preferable to quickly learn an approximations function over converged values of each state. A popular way to represent this approximate function currently is deep neural network.

The SARSA Algorithm ( from CS234 lecture4 slide)
The Q-Learning Algorithm ( from CS234 lecture4 slide )

Q-Learning learns a Q-function that satisfies the Bellman Equation. Bellman Equation is a recursive relation on value function.

Bellman equation (from CS234 lecture2 slide)

Learning the Q-function is accomplished by minimizing the Mean Squared Bellman Error (MSBE) loss function. The Q-function is then used to obtain a policy (e.g. ε-greedy ) as shown in the previous picture.

In Deep Q-Learning, Q is represented by a Deep Neural Network, Q(s,a,w) with parameters w. The update to Q in the algorithm is through SGD on some measure ( e.g. MSE) of difference between oracle/true/correct/target value and the function approximation.

Policy Gradient methods directly try to maximize the expected return by taking small steps in the direction of the policy gradient. Policy gradient algorithms search for a local maximum in value, V(θ) = Vπ(θ), by ascending the gradient of the policy, π(θ), w.r.t parameters θ , ∆θ = α∇θV(θ). ∇θV(θ) is the policy gradient. Which after some approximation and algebraic works is captured in the Policy Gradient Theorem as follows. The point to note is that it is independent of agent and environment model.

The Policy Gradient Theorem ( from CS234 lecture8 slides )

The policy gradient estimator, ∇θV(θ) , can be obtained by differentiating the policy objective, J(θ) (or LPG(θ) , policy gradient loss).

Unlike general RL, in a Multi-Armed Bandit environment at each time step, the agent takes an action ( based on its policy) where the next state does not depend on the action chosen by the agent.

PPO- is a simpler algorithm with the all the stability and reliability benefits of trust region policy optimization (TRPO) algorithm. In TRPO, a surrogate objective function is maximized subject to a KL divergence constraint on the size of the policy update. PPO replaces the KL divergence constraint with a clipping hyper-parameter to limit the updates.

Whereas standard policy gradient methods perform one gradient update per data sample, PPO objective function enables multiple epochs of mini-batch updates. PPO alternates between sampling data from the policy and performing several epochs of optimization on the sampled data.

With a neural network architecture that shares parameters between the policy and value function, a loss function that combines the policy surrogate and a value function error term is used. TRPO does not share parameters.

The following shows the PPO algorithm.

The PPO Algorithm ( from paper Proximal Policy Optimization )

In the shared policy and value approach, networks are the same ( see papers “The 37 Implementation Details of Proximal Policy Optimization” and “value_latent”) .

References:

  1. HuggingFace Transformers Notebooks , https://huggingface.co/docs/transformers/notebooks
  2. https://platform.openai.com/docs/models
  3. Improving Language Understanding by Generative Pre-Training, https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf
  4. GPT2 — https://github.com/openai/gpt-2 ( implemented in TensorFlow)
  5. HF GPT2’s gpt2 (124M/12L/768W), gpt2-medium(355M), gpt2-large (774M), gpt-XL(1.5B/48L/1600W), — https://huggingface.co/gpt2 , https://github.com/huggingface/transformers/tree/main/src/transformers/models/gpt2 ( has both TF and PT codes) .
  6. Language Models are Unsupervised Multitask Learners, https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf
  7. GPT3 — https://en.wikipedia.org/wiki/GPT-3
  8. GPT3 — Language Models are Few-Shot Learners, https://arxiv.org/abs/2005.14165
  9. Deep Reinforcement Learning from Human Preferences, https://arxiv.org/pdf/1706.03741.pdf
  10. Scalable agent alignment via reward modeling:
    a research direction, https://arxiv.org/pdf/1811.07871.pdf
  11. Better Rewards Yield Better Summaries:
    Learning to Summarise Without Reference, https://arxiv.org/pdf/1909.01214.pdf
  12. Fine-Tuning Language Models from Human Preferences, https://arxiv.org/pdf/1909.08593.pdf
  13. Learning to summarize from human feedback, https://arxiv.org/pdf/2009.01325.pdf , https://github.com/openai/summarize-from-feedback , https://openaipublic.blob.core.windows.net/summarize-from-feedback/website/index.html#/ ,
  14. InstructGPT, https://openai.com/research/instruction-following,
  15. InstructGPT, Training language models to follow instructions
    with human feedback, https://arxiv.org/pdf/2203.02155.pdf
  16. Introducing ChatGPT, https://openai.com/blog/chatgpt
  17. GPT-4 Technical Report, https://arxiv.org/pdf/2303.08774.pdf
  18. GPT-Neo https://github.com/EleutherAI/gpt-neo
  19. GPT-NeoX https://github.com/EleutherAI/gpt-neox/
  20. https://github.com/Lightning-AI/lit-gpt
  21. C4 dataset . https://huggingface.co/datasets/allenai/c4
  22. Scaling Language Model Training to a Trillion Parameters Using Megatron, https://developer.nvidia.com/blog/scaling-language-model-training-to-a-trillion-parameters-using-megatron/
  23. Llam2 — https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/
    Llama 2: Open Foundation and Fine-Tuned Chat Models,
    https://github.com/facebookresearch/llama-recipes/blob/main/docs/Dataset.md
    https://github.com/facebookresearch/llama-recipes/tree/main/examples
    https://github.com/facebookresearch/llama-recipes/tree/main
  24. Llama 2 is here — get it on Hugging Face, https://huggingface.co/blog/llama2, https://github.com/huggingface/blog/blob/main/llama2.md
  25. Understanding Encoder And Decoder LLMs https://magazine.sebastianraschka.com/p/understanding-encoder-and-decoder
  26. Gradient Checkpointing — Training Deep Nets with Sublinear Memory Cost, https://arxiv.org/pdf/1604.06174v2.pdf
  27. https://huggingface.co/datasets/timdettmers/openassistant-guanaco — subset of oasst highest-rated paths in the conversation tree, with a total of 9,846 samples.
  28. Dolly (Databricks) -15,000 instruction-context-response triples, https://github.com/databrickslabs/dolly , https://huggingface.co/datasets/databricks/databricks-dolly-15k
  29. MMLU (Massive Multitask Language Understanding) — https://arxiv.org/abs/2009.03300v3, https://github.com/hendrycks/test, https://huggingface.co/datasets/cais/mmlu
  30. HF LLM Leaderboard — https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
  31. LLM Inference Performance Engineering: Best Practices, https://www.databricks.com/blog/llm-inference-performance-engineering-best-practices
  32. Prometheus: Inducing Fine-grained Evaluation Capability in Language Models, https://arxiv.org/abs/2310.08491, https://huggingface.co/datasets/kaist-ai/Feedback-Collection, https://huggingface.co/kaist-ai/prometheus-13b-v1.0
  33. Rejection Sampling https://web.mit.edu/urban_or_book/www/book/chapter7/7.1.3.html, https://people.eecs.berkeley.edu/~jordan/courses/260-spring10/lectures/lecture17.pdf
  34. Reducing Activation Recomputation in Large Transformer Models, https://arxiv.org/abs/2205.05198677
  35. MQAN , The Natural Language Decathlon: Multitask Learning as Question Answering, https://arxiv.org/pdf/1806.08730.pdf
  36. Introduction to RL , https://spinningup.openai.com/en/latest/spinningup/rl_intro.html
  37. Learning from human preferences, https://openai.com/research/learning-from-human-preferences
  38. RLHF: Reinforcement Learning from Human Feedback, https://towardsdatascience.com/rlhf-reinforcement-learning-from-human-feedback-faa5ff4761d1
  39. RLHF, https://en.wikipedia.org/wiki/Reinforcement_learning_from_human_feedback
  40. Proximal Policy Optimization (PPO), https://openai.com/research/openai-baselines-ppo
  41. Proximal Policy Optimization Algorithms, https://arxiv.org/pdf/1707.06347.pdf
  42. The 37 Implementation Details of Proximal Policy Optimization, https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/
  43. value_latent , https://github.com/openai/baselines/blob/master/README.md?plain=1#L92
  44. Temperature sampling , top-k samplng — THE CURIOUS CASE OF NEURAL TEXT DeGENERATION, https://arxiv.org/pdf/1904.09751.pdf
  45. CSC 411 Lecture 21–22: Reinforcement learning, https://www.cs.toronto.edu/~jlucas/teaching/csc411/lectures/lec21_22_handout.pdf
  46. SARSA — https://en.wikipedia.org/wiki/State-action-reward-state-action
  47. Q-Learning — https://en.wikipedia.org/wiki/Q-learning
  48. CS234: Reinforcement Learning Winter 2019, https://web.stanford.edu/class/cs234/CS234Win2019/schedule.html
  49. David Silver’s UCL Course on RL , https://www.davidsilver.uk/teaching/ , https://www.davidsilver.uk/wp-content/uploads/2020/03/pg.pdf
  50. HuggingFace Deep Reinforcement Learning Course https://huggingface.co/learn/deep-rl-course/unit0/introduction
  51. Introduction to Multi-Armed Bandits, https://www.tensorflow.org/agents/tutorials/intro_bandit

--

--