Deep reinforcement learning has achieved impressive success in control tasks. However, its policies, represented as opaque neural networks, are often difficult for humans to understand, verify, and debug, which undermines trust and hinders real-world deployment. This work addresses this challenge by introducing a novel approach for programmatic control policy discovery, called Multimodal Large Language Model-assisted Evolutionary Search (MLES). MLES utilizes multimodal large language models as programmatic policy generators, combining them with evolutionary search to automate policy generation. It integrates visual feedback-driven behavior analysis within the policy generation process to identify failure patterns and guide targeted improvements, thereby enhancing policy discovery efficiency and producing adaptable, human-aligned policies. Experimental results demonstrate that MLES achieves performance comparable to Proximal Policy Optimization (PPO) across two standard control tasks while providing transparent control logic and traceable design processes. This approach also overcomes the limitations of predefined domain-specific languages, facilitates knowledge transfer and reuse, and is scalable across various tasks, showing promise as a new paradigm for developing transparent and verifiable control policies. Code is publicly available at https://github.com/QingL2000/MLES.
翻译:深度强化学习在控制任务中取得了令人瞩目的成就。然而,其策略通常以不透明的神经网络表示,这使其难以被人类理解、验证和调试,从而削弱了信任并阻碍了实际部署。本研究通过引入一种称为多模态大语言模型辅助进化搜索(MLES)的新方法来解决这一挑战,用于程序化控制策略的发现。MLES利用多模态大语言模型作为程序化策略生成器,并将其与进化搜索相结合以自动化策略生成。它在策略生成过程中集成了视觉反馈驱动的行为分析,以识别故障模式并指导针对性改进,从而提升策略发现效率并产生适应性强的、符合人类意图的策略。实验结果表明,MLES在两个标准控制任务上取得了与近端策略优化(PPO)相当的性能,同时提供了透明的控制逻辑和可追溯的设计过程。该方法还克服了预定义领域特定语言的局限性,促进了知识迁移与重用,并且能够跨多种任务扩展,展现出作为开发透明且可验证控制策略新范式的潜力。代码公开于 https://github.com/QingL2000/MLES。