Large Language Models (LLMs) have demonstrated impressive capabilities in code generation. While an interactive feedback loop can improve performance, writing effective tests is a non-trivial task. Early multi-agent frameworks, such as AgentCoder, automated this process but relied on generated tests as absolute ground truth. This approach is fragile: incorrect code frequently passes faulty or trivial tests, while valid solutions are often degraded to satisfy incorrect assertions. Addressing this limitation, newer methods have largely abandoned test generation in favor of planning and reasoning based on examples. We argue, however, that generated tests remain a valuable signal if we model them as noisy sensors guided by bayesian updates. To this end, we introduce BACE (Bayesian Anchored Co-Evolution), a framework that reformulates synthesis as a Bayesian co-evolutionary process where code and test populations are evolved, guided by belief distributions that are reciprocally updated based on noisy interaction evidence. By anchoring this search on minimal public examples, BACE prevents the co-evolutionary drift typical of self-validating loops. Extensive evaluations on LiveCodeBench v6 (post-March 2025) reveal that BACE achieves superior performance across both proprietary models and open-weight small language models.
翻译:大型语言模型(LLM)在代码生成方面展现出令人瞩目的能力。虽然交互式反馈循环可以提升性能,但编写有效的测试是一项具有挑战性的任务。早期的多智能体框架(如AgentCoder)自动化了这一过程,却将生成的测试视为绝对基准真值。这种方法存在脆弱性:错误的代码常能通过有缺陷或琐碎的测试,而正确的解决方案却常因要满足错误断言而被降级。为克服这一局限,较新的方法大多放弃了测试生成,转而采用基于示例的规划与推理。然而,我们认为,如果将生成的测试建模为受贝叶斯更新引导的带噪传感器,那么这些测试仍能提供有价值的信号。为此,我们提出了BACE(贝叶斯锚定协同演化)框架,该框架将代码合成重新表述为一个贝叶斯协同演化过程:代码种群与测试种群共同演化,并由基于带噪交互证据相互更新的信念分布进行引导。通过将搜索锚定在最小化公开示例上,BACE防止了自我验证循环中常见的协同演化漂移。在LiveCodeBench v6(2025年3月后)上的广泛评估表明,BACE在专有模型和开源小语言模型上均取得了卓越性能。