Recent studies have shown that Large Language Models (LLMs) struggle to accurately retrieve information and maintain reasoning capabilities when processing long-context inputs. To address these limitations, we propose a finetuning approach utilizing a carefully designed synthetic dataset comprising numerical key-value retrieval tasks. Our experiments on models like GPT-3.5 Turbo and Mistral 7B demonstrate that finetuning LLMs on this dataset significantly improves LLMs' information retrieval and reasoning capabilities in longer-context settings. We present an analysis of the finetuned models, illustrating the transfer of skills from synthetic to real task evaluations (e.g., $10.5\%$ improvement on $20$ documents MDQA at position $10$ for GPT-3.5 Turbo). We also find that finetuned LLMs' performance on general benchmarks remains almost constant while LLMs finetuned on other baseline long-context augmentation data can encourage hallucination (e.g., on TriviaQA, Mistral 7B finetuned on our synthetic data cause no performance drop while other baseline data can cause a drop that ranges from $2.33\%$ to $6.19\%$). Our study highlights the potential of finetuning on synthetic data for improving the performance of LLMs on longer-context tasks.
翻译:近期研究表明,大语言模型在处理长上下文输入时,难以准确检索信息并保持推理能力。为应对这些局限,我们提出一种微调方法,该方法利用精心设计的包含数值键值检索任务的合成数据集。我们在GPT-3.5 Turbo和Mistral 7B等模型上的实验表明,基于该数据集对大语言模型进行微调,能显著提升其在长上下文场景下的信息检索与推理能力。我们对微调后的模型进行了分析,展示了从合成任务到真实任务评估的技能迁移(例如,GPT-3.5 Turbo在20篇文档MDQA任务中位置10处的性能提升了$10.5\%$)。我们还发现,经微调的大语言模型在通用基准测试上的性能几乎保持不变,而基于其他基线长上下文增强数据微调的模型可能诱发幻觉(例如,在TriviaQA任务中,基于我们合成数据微调的Mistral 7B未出现性能下降,而基于其他基线数据微调可能导致$2.33\%$至$6.19\%$的性能下降)。本研究凸显了利用合成数据微调以提升大语言模型在长上下文任务中性能的潜力。