In the realm of large language models (LLMs), the ability of models to accurately follow instructions is paramount as more agents and applications leverage LLMs for construction, where the complexity of instructions are rapidly increasing. However, on the one hand, there is only a certain amount of complex instruction evaluation data; on the other hand, there are no dedicated algorithms to improve the ability to follow complex instructions. To this end, this paper introduces TRACE, a benchmark for improving and evaluating the complex instructionfollowing ability, which consists of 120K training data and 1K evaluation data. Furthermore, we propose IOPO (Input-Output Preference Optimization) alignment method which takes both input and output preference pairs into consideration, where LLMs not only rapidly align with response preferences but also meticulously explore the instruction preferences. Extensive experiments on both in-domain and outof-domain datasets confirm the effectiveness of IOPO, showing 8.15%, 2.18% improvements on in-domain data and 6.29%, 3.13% on outof-domain data compared to SFT and DPO respectively.
翻译:在大语言模型领域,随着更多智能体和应用利用大语言模型进行构建,指令复杂性迅速增加,模型准确遵循指令的能力变得至关重要。然而,一方面现有复杂指令评估数据规模有限;另一方面,缺乏专门提升复杂指令遵循能力的算法。为此,本文提出TRACE基准,用于改进和评估复杂指令遵循能力,该基准包含12万条训练数据和1千条评估数据。进一步,我们提出IOPO(输入-输出偏好优化)对齐方法,该方法同时考虑输入与输出偏好对,使大语言模型不仅能快速对齐响应偏好,还能精细探索指令偏好。在领域内和领域外数据集上的大量实验证实了IOPO的有效性:相较于SFT和DPO方法,在领域内数据上分别提升8.15%和2.18%,在领域外数据上分别提升6.29%和3.13%。