Large language models (LLMs) have demonstrated impressive reasoning capabilities, yet there is ongoing debate about these abilities and the potential data contamination problem recently. This paper aims to evaluate the reasoning capacities of LLMs, specifically in solving recent competition-level programming problems in Codeforces, which are expert-crafted and unique, requiring deep understanding and robust reasoning skills. We first provide a comprehensive evaluation of GPT-4's peiceived zero-shot performance on this task, considering various aspects such as problems' release time, difficulties, and types of errors encountered. Surprisingly, the peiceived performance of GPT-4 has experienced a cliff like decline in problems after September 2021 consistently across all the difficulties and types of problems, which shows the potential data contamination, as well as the challenges for any existing LLM to solve unseen complex reasoning problems. We further explore various approaches such as fine-tuning, Chain-of-Thought prompting and problem description simplification, unfortunately none of them is able to consistently mitigate the challenges. Through our work, we emphasis the importance of this excellent data source for assessing the genuine reasoning capabilities of LLMs, and foster the development of LLMs with stronger reasoning abilities and better generalization in the future.
翻译:大语言模型(LLMs)展现出令人瞩目的推理能力,但近期关于这些能力及潜在数据污染问题的争论仍在持续。本研究旨在评估LLMs的推理能力,特别是解决Codeforces平台上最新竞赛级编程问题的表现——这些由专家设计的独特问题需要深度理解与稳健的推理技能。我们首先对GPT-4在该任务中的零样本感知性能进行全面评估,涵盖问题发布时间、难度及错误类型等多维因素。令人惊讶的是,GPT-4在2021年9月后发布的问题上表现出断崖式性能下降,且在所有难度与问题类型中持续存在——这既揭示了潜在的数据污染问题,也印证了现有LLMs在解决未见复杂推理任务时面临的挑战。我们进一步探索了微调、思维链提示与问题描述简化等多种方案,但均无法持续缓解这一难题。通过本研究,我们强调这一优质数据源对评估LLMs真实推理能力的重要性,并将推动未来具备更强推理能力与泛化性的LLMs开发。