LLMs demonstrate strong performance on code benchmarks, yet round-trip code execution reveals limitations in their ability to maintain consistent reasoning across forward and backward execution. We present RoundTripCodeEval (RTCE), a comprehensive benchmark consisting of four distinct code execution reasoning tasks designed to rigorously test round-trip consistency. RTCE provides an execution-free, exact-match evaluation of bijection fidelity, assessing whether models preserve a consistent one-to-one mapping between encoding and decoding operations across various algorithms and directions. We systematically evaluate state-of-the-art Code-LLMs using zero-shot prompting, supervised fine-tuning on execution traces, and self-reflection mechanisms. Each yields modest improvements, but none closes the gap, indicating that current LLMs struggle with true round-trip consistency, which demonstrates that they lack the internal coherence required for trustworthy code reasoning. RTCE surfaces several new and previously unmeasured insights that are not captured by existing I/O-prediction, execution-reasoning, or round-trip natural-language benchmarks. We will release the code and the dataset upon acceptance.
翻译:大型语言模型在代码基准测试中展现出优异性能,但循环代码执行揭示了其在正向与反向执行间保持一致性推理能力的局限。我们提出了RoundTripCodeEval(RTCE)——一个由四个独立代码执行推理任务构成的综合性基准,旨在严格测试循环一致性。RTCE通过免执行的精确匹配评估双射保真度,检验模型在不同算法和方向上是否保持编码与解码操作间一致的一对一映射关系。我们系统评估了前沿的代码大语言模型,采用零样本提示、基于执行轨迹的监督微调及自反思机制。每种方法均带来有限改进,但均未能弥合差距,表明当前大语言模型难以实现真正的循环一致性,这证明其缺乏可信代码推理所需的内在连贯性。RTCE揭示了现有输入输出预测、执行推理或循环自然语言基准未能捕捉的若干新发现及未测量现象。我们将在论文录用后公开代码与数据集。