Large language models (LLMs) have shown remarkable capabilities across diverse coding tasks. However, their adoption requires a true understanding of program execution rather than relying on surface-level patterns. Existing benchmarks primarily focus on predicting program properties tied to specific inputs (e.g., code coverage, program outputs). As a result, they provide a narrow view of dynamic code reasoning and are prone to data contamination. We argue that understanding program execution requires evaluating its inherent duality through two complementary reasoning tasks: (i) predicting a program's observed behavior for a given input, and (ii) inferring how the input must be mutated toward a specific behavioral objective. Both tasks jointly probe a model's causal understanding of execution flow. We instantiate this duality in DexBench, a benchmark comprising 445 paired instances, and evaluate 13 LLMs. Our results demonstrate that dual-path reasoning provides a robust and discriminative proxy for dynamic code understanding.
翻译:大型语言模型(LLM)在各类编码任务中展现出卓越能力。然而,其应用要求对程序执行具备真实理解,而非依赖表层模式。现有基准测试主要关注预测与特定输入相关的程序属性(如代码覆盖率、程序输出)。因此,它们仅提供动态代码推理的狭窄视角,且易受数据污染影响。我们论证,理解程序执行需通过两项互补推理任务评估其固有关联性:(i)预测给定输入下程序的观测行为;(ii)推断如何针对特定行为目标对输入进行变异。两项任务共同探测模型对执行流程的因果理解。我们将这种对偶性实例化为包含445对样本的DexBench基准,并评估了13个LLM。实验结果表明,对偶路径推理为动态代码理解提供了稳健且具有判别力的代理指标。