Aligning large language models to handle instructions with extremely long contexts has yet to be fully investigated. Previous studies attempt to scale up the available data volume by synthesizing long instruction-following samples, as constructing such a dataset tends to be challenging for annotators. However, a lack of a well-defined strategy for ensuring data quality may introduce low-quality samples and restrict the model performance. Thus, we propose GATEAU, a novel framework to address the unique challenge of long context alignment by identifying the influential samples enriched with long-range dependency relations. Specifically, GATEAU measures the long-range dependencies from two essential aspects: the difficulty of generating target responses due to the long-range dependencies, and the difficulty of understanding long inputs due to such dependencies. Comprehensive experiments indicate that GATEAU effectively identifies influential samples and the model trained on these selected samples exhibits better instruction-following and long-context understanding capabilities.
翻译:使大语言模型能够处理极长上下文的指令对齐任务尚未得到充分研究。先前的研究尝试通过合成长指令跟随样本来扩大可用数据量,因为构建此类数据集对标注者而言往往具有挑战性。然而,由于缺乏确保数据质量的明确策略,可能会引入低质量样本并限制模型性能。因此,我们提出GATEAU,这是一个新颖的框架,旨在通过识别富含长程依赖关系的关键样本来应对长上下文对齐这一独特挑战。具体而言,GATEAU从两个基本方面度量长程依赖性:因长程依赖关系导致生成目标响应的难度,以及因这种依赖关系导致理解长输入的难度。综合实验表明,GATEAU能有效识别关键样本,并且在这些选定样本上训练的模型展现出更好的指令跟随和长上下文理解能力。