Reinforcement Learning from Human Feedback
How to scale RL to combinatorially sized language action spaces and messy human preference rewards?
The ultimate aim of language technology is to interact with humans. However, most such systems are trained without direct signals of human preference, with supervised target strings serving as (a sometimes crude) proxy. This work focuses on using reinforcement learning to interact and align to human preferences.