Existing algorithms for reinforcement learning from human feedback (RLHF) can incentivize responses at odds with preferences because they are based on models that assume independence of irrelevant alternatives (IIA). The perverse incentives induced by IIA hinder innovations on query formats and learning algorithms.
翻译:基于人类反馈的强化学习(RLHF)现有算法可能激励出与偏好相悖的响应,因为这些算法依赖的模型假设无关选项的独立性(IIA)。IIA所引发的反常激励阻碍了查询格式与学习算法的创新。