Currently the prevailing form of human preference data collection is as thus: humans are presented with two or more outputs and asked to select one or rank them. This signal is then used to train a reward model, which computes a single scalar reward for each LM-generated sequence. The LM is then trained with RL to optimize the reward it receives (from the reward model). Such a reward provides a relatively sparse training signal, especially for tasks that require the generation of long-form text—making RLHF in such domains unreliable. This project focuses on what it would take to move towards more natural (language), fine-grained rewards along various dimensions.
- Fine-Grained Human Feedback Gives Better Rewards for Language Model TrainingIn Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS), 2023
- INSCIT: Information-Seeking Conversations with Mixed-Initiative InteractionsTransactions of the Association for Computational Linguistics (TACL), 2022