We present a multi-agent framework for generating physics simulation code from natural language descriptions, featuring a novel perceptual self-reflection mechanism for validation. The system employs four specialized agents: a natural language interpreter that converts user requests into physics-based descriptions; a technical requirements generator that produces scaled simulation parameters; a physics code generator with automated self-correction; and a physics validator that implements perceptual self-reflection. The key innovation is perceptual validation, which analyzes rendered animation frames using a vision-capable language model rather than inspecting code structure directly. This approach addresses the ``oracle gap'' where syntactically correct code produces physically incorrect behavior--a limitation that conventional testing cannot detect. We evaluate the system across seven domains including classical mechanics, fluid dynamics, thermodynamics, electromagnetics, wave physics, reaction-diffusion systems, and non-physics data visualization. The perceptual self-reflection architecture demonstrates substantial improvement over single-shot generation baselines, with the majority of tested scenarios achieving target physics accuracy thresholds. The system exhibits robust pipeline stability with consistent code self-correction capability, operating at approximately \$0.20 per animation. These results validate our hypothesis that feeding visual simulation outputs back to a vision-language model for iterative refinement significantly outperforms single-shot code generation for physics simulation tasks and highlights the potential of agentic AI to support engineering workflows and physics data generation pipelines.
翻译:我们提出了一种从自然语言描述生成物理仿真代码的多智能体框架,其核心创新在于引入了用于验证的感知自反思机制。该系统部署了四个专用智能体:将用户请求转换为基于物理描述的自然语言解释器;生成标度化仿真参数的技术需求生成器;具备自动纠错功能的物理代码生成器;以及实现感知自反思的物理验证器。关键技术突破在于感知验证机制——该方法通过视觉增强语言模型分析渲染后的动画帧,而非直接检查代码结构。这一方案有效解决了"预言鸿沟"问题,即语法正确的代码可能产生物理错误的仿真行为,而传统测试方法无法检测此类缺陷。我们在经典力学、流体动力学、热力学、电磁学、波动物理、反应扩散系统及非物理数据可视化等七个领域对该系统进行了评估。感知自反思架构相较于单次生成基线模型展现出显著改进,大多数测试场景达到了目标物理精度阈值。该系统表现出稳健的流水线稳定性与持续的代码自校正能力,单次动画生成成本约为0.20美元。实验结果证实了我们的假设:将可视化仿真输出反馈至视觉语言模型进行迭代优化,在物理仿真任务中显著优于单次代码生成方法,同时凸显了具身人工智能在支持工程工作流与物理数据生成管道方面的潜力。