How to build better rewards, i.e. proxy models of human preferences?
Currently the prevailing form of human preference data collection is as thus: humans are presented with two or more outputs and asked to select one or rank them. This signal is then used to train a reward model, which computes a single scalar reward for each LM-generated sequence. The LM is then trained with RL to optimize the reward it receives (from the reward model). Such a reward provides a relatively sparse training signal, especially for tasks that require the generation of long-form text—making RLHF in such domains unreliable. This project focuses on what it would take to move towards more natural (language), fine-grained rewards along various dimensions.
References
2023
Fine-Grained Human Feedback Gives Better Rewards for Language Model Training
Zeqiu Wu, Yushi Hu, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj Ammanabrolu, Noah A. Smith, Mari Ostendorf, and Hannaneh Hajishirzi
In Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS), 2023
@inproceedings{wu2023finegrained,title={Fine-Grained Human Feedback Gives Better Rewards for Language Model Training},author={Wu, Zeqiu and Hu, Yushi and Shi, Weijia and Dziri, Nouha and Suhr, Alane and Ammanabrolu, Prithviraj and Smith, Noah A. and Ostendorf, Mari and Hajishirzi, Hannaneh},booktitle={Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS)},year={2023},url={https://arxiv.org/abs/2306.01693},}
2022
INSCIT: Information-Seeking Conversations with Mixed-Initiative Interactions
Zeqiu Wu, Ryu Parish, Hao Cheng, Sewon Min, Prithviraj Ammanabrolu, Mari Ostendorf, and Hannaneh Hajishirzi
Transactions of the Association for Computational Linguistics (TACL), 2022
@article{wu2022inscit,title={INSCIT: Information-Seeking Conversations with Mixed-Initiative Interactions},author={Wu, Zeqiu and Parish, Ryu and Cheng, Hao and Min, Sewon and Ammanabrolu, Prithviraj and Ostendorf, Mari and Hajishirzi, Hannaneh},journal={Transactions of the Association for Computational Linguistics (TACL)},volume={},url={https://arxiv.org/abs/2207.00746},year={2022},}