Coding agents are increasingly utilized in test-driven software development, yet the theoretical mechanisms behind their environment-interaction strategies remain underexplored. We provide a probabilistic framework for two dominant paradigms: code selection after generation using the execution environment, and code generation conditioned on environment feedback. First, we formalize several well-established selection heuristics as environment-aware estimators of code correctness. We theoretically prove that estimators based on fuzzy functional similarity add an inductive bias and strictly dominate estimators based on functional equivalence in terms of signal-to-noise ratio. Second, we frame backprompting as an in-context approximation of Thompson sampling. We derive a novel regret bound for reward functions with unobservable components, theoretically explaining why the effectiveness of backprompting is limited by the ambiguity of the informal task description (an irreducible regret). Using three state-of-the-art open weight models, we corroborate these findings across BigCodeBenchHard, LeetCodeDataset, and QiskitHumanEvalSim. Our formalization also suggests how to improve task descriptions effectively, leading to a new benchmark, QiskitHumanEvalSimX.
翻译:编码智能体在测试驱动软件开发中的应用日益广泛,但其环境交互策略背后的理论机制仍未得到充分探索。我们为两种主流范式提供了一个概率框架:利用执行环境进行生成后的代码选择,以及基于环境反馈的条件代码生成。首先,我们将几种成熟的启发式选择方法形式化为代码正确性的环境感知估计器。我们从理论上证明,基于模糊功能相似性的估计器引入了归纳偏置,并且在信噪比方面严格优于基于功能等价性的估计器。其次,我们将反向提示(backprompting)框架为汤普森采样(Thompson sampling)的上下文近似。我们针对具有不可观测分量的奖励函数推导出一个新颖的遗憾界(regret bound),从理论上解释了为何反向提示的有效性受限于非正式任务描述的模糊性(一种不可约的遗憾)。通过使用三个最先进的开源权重模型,我们在BigCodeBenchHard、LeetCodeDataset和QiskitHumanEvalSim数据集上验证了这些发现。我们的形式化工作也提示了如何有效改进任务描述,从而催生了一个新的基准测试集QiskitHumanEvalSimX。