Large Language Models (LLMs) demonstrate potentials for automating scientific code generation but face challenges in reliability, error propagation in multi-agent workflows, and evaluation in domains with ill-defined success metrics. We present a Bayesian adversarial multi-agent framework specifically designed for AI for Science (AI4S) tasks in the form of a Low-code Platform (LCP). Three LLM-based agents are coordinated under the Bayesian framework: a Task Manager that structures user inputs into actionable plans and adaptive test cases, a Code Generator that produces candidate solutions, and an Evaluator providing comprehensive feedback. The framework employs an adversarial loop where the Task Manager iteratively refines test cases to challenge the Code Generator, while prompt distributions are dynamically updated using Bayesian principles by integrating code quality metrics: functional correctness, structural alignment, and static analysis. This co-optimization of tests and code reduces dependence on LLM reliability and addresses evaluation uncertainty inherent to scientific tasks. LCP also streamlines human-AI collaboration by translating non-expert prompts into domain-specific requirements, bypassing the need for manual prompt engineering by practitioners without coding backgrounds. Benchmark evaluations demonstrate LCP's effectiveness in generating robust code while minimizing error propagation. The proposed platform is also tested on an Earth Science cross-disciplinary task and demonstrates strong reliability, outperforming competing models.
翻译:大型语言模型(LLM)在自动化科学代码生成方面展现出潜力,但在可靠性、多智能体工作流中的错误传播以及成功标准不明确领域的评估方面面临挑战。我们提出一种专门为科学人工智能(AI4S)任务设计的贝叶斯对抗多智能体框架,并以低代码平台(LCP)的形式实现。该框架在贝叶斯体系下协调三个基于LLM的智能体:将用户输入结构化为可执行计划和自适应测试用例的任务管理器、生成候选解决方案的代码生成器以及提供全面反馈的评估器。该框架采用对抗循环机制,任务管理器迭代优化测试用例以挑战代码生成器,同时通过整合代码质量指标——功能正确性、结构对齐和静态分析——利用贝叶斯原理动态更新提示分布。这种测试与代码的协同优化降低了对LLM可靠性的依赖,并解决了科学任务固有的评估不确定性。LCP还通过将非专家提示转化为领域特定需求,简化了人机协作,使无编码背景的研究者无需手动进行提示工程。基准评估表明LCP在生成鲁棒代码的同时能有效最小化错误传播。该平台在地球科学跨学科任务中进行了测试,展现出卓越的可靠性,性能优于现有竞争模型。