Large language models (LLMs) have demonstrated impressive reasoning capabilities, yet there is ongoing debate about these abilities and the potential data contamination problem recently. This paper aims to evaluate the reasoning capacities of LLMs, specifically in solving recent competition-level programming problems in Codeforces, which are expert-crafted and unique, requiring deep understanding and robust reasoning skills. We first provide a comprehensive evaluation of GPT-4's peiceived zero-shot performance on this task, considering various aspects such as problems' release time, difficulties, and types of errors encountered. Surprisingly, the peiceived performance of GPT-4 has experienced a cliff like decline in problems after September 2021 consistently across all the difficulties and types of problems, which shows the potential data contamination, as well as the challenges for any existing LLM to solve unseen complex reasoning problems. We further explore various approaches such as fine-tuning, Chain-of-Thought prompting and problem description simplification, unfortunately none of them is able to consistently mitigate the challenges. Through our work, we emphasis the importance of this excellent data source for assessing the genuine reasoning capabilities of LLMs, and foster the development of LLMs with stronger reasoning abilities and better generalization in the future.
翻译:大型语言模型(LLMs)已展现出令人瞩目的推理能力,但关于这些能力及近期潜在数据污染问题的争论仍在持续。本文旨在评估LLMs的推理能力,具体聚焦于解决Codeforces平台近期竞技级编程问题——这些问题由专家精心设计、具有独特性,要求深度理解与强健的推理技能。我们首先对GPT-4在零样本条件下的表观性能进行全面评估,涵盖问题发布时间、难度及错误类型等多维度因素。令人惊讶的是,GPT-4在2021年9月之后发布的问题中,无论难度与题型,其表观性能均呈断崖式下降,这不仅揭示了潜在的数据污染问题,更表明现有LLM在应对未见过复杂推理任务时面临的挑战。我们进一步探索了微调、思维链提示及问题描述简化等多种方法,但遗憾的是,均无法持续缓解该困境。通过本研究,我们强调这一优质数据源对评估LLM真实推理能力的重要性,并推动未来开发具有更强推理能力与更好泛化性能的LLM。