Module 3: Reinforcement Learning from Human Feedback copertina

Module 3: Reinforcement Learning from Human Feedback

Module 3: Reinforcement Learning from Human Feedback

Ascolta gratuitamente

Vedi i dettagli del titolo

A proposito di questo titolo

This episode addresses how Reinforcement Learning from Human Feedback (RLHF) adds the final layer of alignment after supervised fine-tuning, shifting the training signal from “right vs wrong” to “better vs worse.” We explore how preference rankings create a reward signal (reward models plus PPO) and the newer shortcut (DPO) that learns preferences directly, then connect RLHF to safety through the Helpful, Honest, Harmless goal. We also unpack the “alignment tax,” the trade-off between being safe and being genuinely useful, and close by setting up the next module on running models at scale, starting with GPU memory limits, plus a personal reflection on starting later without being behind.

Ancora nessuna recensione