GPT-5 vs Other LLMs in Long Short-Context Performance

from arxiv, 10 pages, 7 figures. Accepted for publication in the 3rd International Conference on Foundation and Large Language Models (FLLM2025). IEEE. The final version will be available in IEEE Xplore

With the significant expansion of the context window in Large Language Models (LLMs), these models are theoretically capable of processing millions of tokens in a single pass. However, research indicates a significant gap between this theoretical capacity and the practical ability of models to robustly utilize information within long contexts, especially in tasks that require a comprehensive understanding of numerous details. This paper evaluates the performance of four state-of-the-art models (Grok-4, GPT-4, Gemini 2.5, and GPT-5) on long short-context tasks. For this purpose, three datasets were used: two supplementary datasets for retrieving culinary recipes and math problems, and a primary dataset of 20K social media posts for depression detection. The results show that as the input volume on the social media dataset exceeds 5K posts (70K tokens), the performance of all models degrades significantly, with accuracy dropping to around 50-53% for 20K posts. Notably, in the GPT-5 model, despite the sharp decline in accuracy, its precision remained high at approximately 95%, a feature that could be highly effective for sensitive applications like depression detection. This research also indicates that the "lost in the middle" problem has been largely resolved in newer models. This study emphasizes the gap between the theoretical capacity and the actual performance of models on complex, high-volume data tasks and highlights the importance of metrics beyond simple accuracy for practical applications.

翻译：随着大语言模型（LLMs）上下文窗口的显著扩展，这些模型理论上能够单次处理数百万个标记。然而，研究表明，这种理论容量与模型在实际中稳健利用长上下文信息的能力之间存在显著差距，尤其是在需要全面理解大量细节的任务中。本文评估了四种最先进模型（Grok-4、GPT-4、Gemini 2.5和GPT-5）在长短期上下文任务上的性能。为此，使用了三个数据集：两个用于检索烹饪食谱和数学问题的补充数据集，以及一个包含20K条社交媒体帖子的用于抑郁症检测的主要数据集。结果显示，当社交媒体数据集上的输入量超过5K条帖子（70K个标记）时，所有模型的性能均显著下降，对于20K条帖子，准确率降至约50-53%。值得注意的是，在GPT-5模型中，尽管准确率急剧下降，但其精确率仍保持在约95%的高水平，这一特性对于抑郁症检测等敏感应用可能非常有效。本研究还表明，“迷失在中间”的问题在新模型中已基本得到解决。这项研究强调了模型在复杂、高容量数据任务上的理论容量与实际性能之间的差距，并强调了在实际应用中超越简单准确率的度量指标的重要性。