Where Do Large Language Models Fail on Competitive Programming? A Taxonomy of Failures by Algorithm Type and Difficulty Rating

Large language models (LLMs) demonstrate increasing proficiency on competitive programming benchmarks, yet technical reports predominantly publish aggregate pass rates, obscuring domain-specific vulnerabilities. We present a systematic empirical study of LLM failure patterns using a balanced taxonomy of 315 Codeforces problems across seven algorithm categories and three difficulty tiers. We evaluate GPT-4o and Claude Sonnet 4.6 under strict execution-based conditions, controlling for temperature (T = 0.2). To isolate the impact of reasoning frameworks on algorithmic correctness, we conduct an ablation study comparing direct zero-shot generation against zero-shot Chain-of-Thought (CoT). Our findings reveal a severe divergence from standard NLP benchmarks: forcing CoT aggressively penalizes GPT-4o, dropping its pass rate from 46.0% to 36.8% and exacerbating a critical weakness in Greedy logic. Conversely, while Claude maintains a higher logical baseline (63.5% under CoT), the expanded text generation severely degrades its markdown instruction adherence, causing its Compile Errors to more than triple (from 9 to 31, a 244% increase). Furthermore, failure-mode analysis indicates that Wrong Answer (WA) is the dominant verdict for both models--accounting for over 90% of GPT-4o's and roughly 70% of Claude's unaccepted solutions. These findings empirically demonstrate that standard prompt engineering techniques fail to bridge the algorithmic reasoning gap in competitive programming environments.

翻译：大语言模型（LLMs）在竞赛编程基准测试中展现出日益增长的熟练度，然而技术报告主要发布整体的通过率，掩盖了特定领域内的脆弱性。我们利用一个平衡的分类体系——涵盖315道Codeforces问题，跨越七种算法类别和三个难度等级——对LLM的失败模式进行了系统的实证研究。我们在严格的基于执行的条件下评估了GPT-4o和Claude Sonnet 4.6，并控制了温度参数（T = 0.2）。为了隔离推理框架对算法正确性的影响，我们进行了一项消融研究，比较了直接零样本生成与零样本思维链（Chain-of-Thought, CoT）方法。我们的发现揭示了一个与标准NLP基准的严重偏离：强制使用CoT会严重惩罚GPT-4o，将其通过率从46.0%降至36.8%，并加剧了其在贪心逻辑（Greedy logic）方面的关键弱点。相反，虽然Claude在CoT下保持了较高的逻辑基线（63.5%），但扩展的文本生成严重降低了其对markdown指令的遵循能力，导致其编译错误（Compile Errors）增加了三倍以上（从9次增至31次，增长了244%）。此外，失败模式分析表明，对于两种模型，答案错误（Wrong Answer, WA）是主要的判决结果——占GPT-4o未通过解决方案的90%以上，以及Claude约70%的未通过解决方案。这些发现实证性地表明，在竞赛编程环境中，标准的提示工程技术无法弥合算法推理差距。