Reinforcement Learning with Human Feedback

A gentle introduction

6 min readDec 10, 2023

Large Language Models (LLMs) have demonstrated outstanding capabilities in their conversational interactions with human.

In fact, the way we typically consume LLMs is via AI Assistants, such as ChatGPT. The reason why ChatGPT and similar AI assistants were so disruptive (ChatGPT reached 1M users in just 5 days!) is that they are aligned with human preferences, making them extremely good at interacting with users, catch their intents and solve their problems.

Before getting to this level of alignment though, LLMs pass trough a series of steps (you can learn more from Andej Karpathy’s pitch here):

Pre-training →LLMs are trained in an unsupervised way over the huge training dataset. The output of this phase is the so called base model, which is typically a completion model that predicts the next token given the input tokens received by the user.
Supervised fine-tuning (SFT) → the base model is trained in a supervised way with a dataset made of tuples of (prompt, ideal response). The output of this phase is called SFT model.
Reward Modeling (with human preferences)→ this step consists in training another Language Model (smaller than the original one) so that it is able to evaluate the SFT model’s output with a…

Reinforcement Learning with Human Feedback

A gentle introduction

Written by Valentina Alto