This paper explores the impact of extending input lengths on the capabilities of Large Language Models (LLMs). Despite LLMs advancements in recent times, their performance consistency across different input lengths is not well understood. We investigate this aspect by introducing a novel QA reasoning framework, specifically designed to assess the impact of input length. We isolate the effect of input length using multiple versions of the same sample, each being extended with padding of different lengths, types and locations. Our findings show a notable degradation in LLMs' reasoning performance at much shorter input lengths than their technical maximum. We show that the degradation trend appears in every version of our dataset, although at different intensities. Additionally, our study reveals that the traditional metric of next word prediction correlates negatively with performance of LLMs' on our reasoning dataset. We analyse our results and identify failure modes that can serve as useful guides for future research, potentially informing strategies to address the limitations observed in LLMs.
翻译:本文探讨了扩展输入长度对大型语言模型能力的影响。尽管LLMs近年来取得了进展,但其在不同输入长度下的性能一致性尚未得到充分理解。我们通过引入一种新颖的QA推理框架来研究这一方面,该框架专门用于评估输入长度的影响。我们通过使用同一样本的多个版本来隔离输入长度的效应,每个版本均通过不同长度、类型和位置的填充进行扩展。我们的研究结果表明,在远低于技术最大值的输入长度下,LLMs的推理性能会出现显著下降。我们发现这种性能下降趋势出现在数据集的每个版本中,尽管下降强度有所不同。此外,我们的研究表明,传统的下一个词预测指标与LLMs在推理数据集上的性能呈负相关。我们分析了实验结果,识别出可作为未来研究指导的失效模式,这些发现可能为制定解决LLMs现有局限性的策略提供参考。