LLMs demonstrate strong performance on code benchmarks, yet consistent reasoning across forward and backward execution remains elusive. We present RoundTripCodeEval (RTCE), a benchmark of four code execution reasoning tasks that evaluates round-trip consistency through execution-free, exact-match assessment of bijection fidelity across four lossless compression algorithms. We evaluate state-of-the-art Code-LLMs under zero-shot prompting, supervised fine-tuning on execution traces, and iterative self-reflection. All approaches yield only modest improvements and none closes the gap, revealing that current LLMs lack the internal coherence required for reliable bidirectional code reasoning. RTCE surfaces findings invisible to existing benchmarks: models frequently pass individual forward and backward tasks yet fail the combined round-trip, exposing mutually inconsistent internal representations; SFT and self-reflection saturate after one revision round, indicating they cannot repair fundamental algorithmic misunderstandings; and failures persist even on simple bijections such as RLE, suggesting that algorithmic complexity is not the sole root cause.\footnote{Code and dataset are available at https://github.com/Nickil21/round-trip-code-compression.
翻译:尽管LLMs在代码基准测试中展现出强劲性能,但在前向与反向执行中保持一致的推理能力仍难以实现。我们提出了RoundTripCodeEval(RTCE)基准,该基准包含四项代码执行推理任务,通过无损执行与精确匹配评估四种无损压缩算法双射保真度的往返一致性。我们评估了最先进的代码LLMs在零样本提示、基于执行轨迹的监督微调以及迭代自我反思下的表现。所有方法仅带来微小改进,且均未能弥合差距,揭示当前LLMs缺乏可靠双向代码推理所需的内在连贯性。RTCE揭示了现有基准无法发现的发现:模型常能通过单独的前向与反向任务,却在组合的往返任务中失败,暴露出互不一致的内部表征;监督微调与自我反思在单轮修正后即达到饱和,表明其无法修复根本性的算法理解错误;即便在如游程编码这类简单双射任务中故障依然存在,暗示算法复杂度并非唯一根本原因。