How to build better rewards, i.e. proxy models of human preferences?
Currently the prevailing form of human preference data collection is as thus: humans are presented with two or more outputs and asked to select one or rank them. This signal is then used to train a reward model, which computes a single scalar reward for each LM-generated sequence. The LM is then trained with RL to optimize the reward it receives (from the reward model). Such a reward provides a relatively sparse training signal, especially for tasks that require the generation of long-form text—making RLHF in such domains unreliable. This project focuses on what it would take to move towards more natural (language), fine-grained rewards along various dimensions.
References
2023
Fine-Grained Human Feedback Gives Better Rewards for Language Model Training
Zeqiu Wu, Yushi Hu, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj Ammanabrolu, Noah A. Smith, Mari Ostendorf, and Hannaneh Hajishirzi
In Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS), 2023