Reinforcement Learning with Human Feedback

Purpose: Alignment of the model with human values. Making the model:
- More helpful.
- Not causing harm: Using toxic language in completions. Reply in combative and aggressive voices. Provide detailed information about dangerous topics.
- Less misinformation: Hallucinate, answer a question it doesn’t know confidently.

Reinforcement Learning

Tic-Tac-Toe

Objective: Maximize reward received for actions.
Policy: LLM itself.
Environment: The context window. The space in which text can be entered via a prompt.
State: Text in the context window.
Action: Generating text.
Reward: How closely the model output aligns with the human preferences.
Rollout: The sequence pf states & actions in the process of fine-tuning.

Reward Model

A model to assess the alignment of a completion, to be used during RLHF training.

How to do RLHF

Have human labelers rank completion on helpfulness, harm, etc.
Convert rankings to pairwise training data.
- For example, less helpful completion using 0 as no reward and more helpful information using 1 as rewarded.
Train reward model, the model returns a score on the alignment of the completion.
- We can use the logits before the probabilities output.
Iteratively fine-tune the model on the dataset. Update LLM weights based on the rewards.

Reward hacking

Avoid reward hacking

Problem: The model is biased towards reward model, and outputs aligned but not relevant results. Similar to overfitting.
Solution: Add a regularize, a reference model that stays frozen. Adds a penalty term on their difference, for example the KL divergence.

KL divergence
$D_{KL}(P||Q) = \sum{P(x)log(\frac{P(x)}{Q(x)})}$
Proximal Policy Optimization (PPO)
Phase 1. Calculate loss of value function
$L^{VF} = \frac{1}{2}||V_{\theta}(s) - (\sum_{t=0}^T\gamma^t{r_t} | s_0=s)||_2^2$

$V_{\theta}(s)$ : Value function. Estimated future rewards.
$(\sum_{t=0}^T\gamma^t{r_t} | s_0=s)$ : Known future total reward.

Phase 2. Calculate loss of policy function
$L^{Policy} = min(\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)})\cdot\hat{A_t}, clip(\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}, 1 - \epsilon, 1 + \epsilon)\cdot\hat{A_t})$
$\pi_\theta$ Model’s probability distribution over tokens.

$\pi_\theta(a_t|s_t)$ : Probabilities of next token on the updated LLM.
$\pi_{\theta_{old}}(a_t|s_t)$ : Probabilities of next token on the initial LLM.
$dot\hat{A_t}$ : Advantage term.
$\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}, 1 - \epsilon, 1 + \epsilon$ : Trust region with original and updated output close.

Phase 3. Calculate entropy loss
$L^{ENT} = entropy(\pi_\theta(\cdot | s_t))$
Combined together: $L^{PPO} = L^{VF} + c_1L^{Policy} + c_2L^{ENT}$

Consitutional AI

Consitution is a set of prompts describing the principles the model has to follow.
Read teaming: Human construct prompts that elicits harmful or unwanted responsed.

Reinforcement Learning with AI feedback

RLAIF

LLM Powered Applications

LLM Lifecycle

RAG: Retrieval-augmented generation. Grounding the model on external information. Bard is doing something like this.
Chain-of-thought planing: Asks the model to show their work. Helps the model deal with more complex math problems.
Program aided language (PAL) models: Have the LLM generate completions where reasoning steps are accompanied by computer code.
ReAct: Combining reasoning and action. Shows a LLM through structured examples how to reason through a problem and decide on actions to take.

levendlee

Generative AI with Large Language Models 3/3

Reinforcement Learning with Human Feedback

Reinforcement Learning

Reward Model

How to do RLHF

Reward hacking

Consitutional AI

Reinforcement Learning with AI feedback

LLM Powered Applications

Leave a comment Cancel reply

Generative AI with Large Language Models 3/3

Reinforcement Learning with Human Feedback

Reinforcement Learning

Reward Model

How to do RLHF

Reward hacking

Consitutional AI

Reinforcement Learning with AI feedback

LLM Powered Applications

Share this:

Leave a comment Cancel reply