Aligned instruction following models can better fulfill user requests than their unaligned counterparts. However, it has been shown that there is a length bias in evaluation of such models, and that training algorithms tend to exploit this bias by learning longer responses. In this work we show how to train models that can be controlled at inference time with instructions containing desired length constraints. Such models are superior in length instructed evaluations, outperforming standard instruction following models such as GPT4, Llama 3 and Mixtral.
翻译:经过对齐的指令遵循模型相较于未对齐的模型,能更好地满足用户请求。然而,研究表明,对此类模型的评估存在长度偏差,且训练算法倾向于通过学习生成长度更长的响应来利用这种偏差。在本工作中,我们展示了如何训练能够在推理时通过包含期望长度约束的指令进行控制的模型。此类模型在遵循长度指令的评估中表现优异,超越了诸如GPT4、Llama 3和Mixtral等标准指令遵循模型。