Training large language models (LLMs) with open-domain instruction data has yielded remarkable success in aligning to end tasks and human preferences. Extensive research has highlighted the importance of the quality and diversity of instruction data. However, the impact of data complexity, as a crucial metric, remains relatively unexplored from three aspects: (1)where the sustainability of performance improvements with increasing complexity is uncertain; (2)whether the improvement brought by complexity merely comes from introducing more training tokens; and (3)where the potential benefits of incorporating instructions from easy to difficult are not yet fully understood. In this paper, we propose Tree-Instruct to systematically enhance the instruction complexity in a controllable manner. By adding a specified number of nodes to instructions' semantic trees, this approach not only yields new instruction data from the modified tree but also allows us to control the difficulty level of modified instructions. Our preliminary experiments reveal the following insights: (1)Increasing complexity consistently leads to sustained performance improvements of LLMs. (2)Under the same token budget, a few complex instructions outperform diverse yet simple instructions. (3)Curriculum instruction tuning might not yield the anticipated results; focusing on increasing complexity appears to be the key.
翻译:训练具有开放域指令数据的大型语言模型(LLMs)在适配终端任务和人类偏好方面取得了显著成功。大量研究强调了指令数据质量和多样性的重要性。然而,数据复杂性作为关键指标,其影响在以下三个方面仍相对未被充分探索:(1) 随着复杂性增加,性能提升的可持续性尚不确定;(2) 复杂性的提升是否仅源于引入更多训练词元;以及 (3) 从简单到困难逐步引入指令的潜在优势尚未完全明晰。本文提出Tree-Instruct方法,以可控方式系统性地增强指令复杂性。该方法通过向指令的语义树添加指定数量的节点,不仅从修改后的树中生成新指令数据,还能控制修改后指令的难度级别。初步实验揭示了以下发现:(1) 增加复杂性能够持续提升LLMs的性能表现。(2) 在相同词元预算下,少量复杂指令优于多样但简单的指令。(3) 课程式指令微调可能无法达到预期效果;聚焦于增加复杂性似乎是关键所在。