Deep reinforcement learning has achieved impressive success in control tasks. However, its policies, represented as opaque neural networks, are often difficult for humans to understand, verify, and debug, which undermines trust and hinders real-world deployment. This work addresses this challenge by introducing a novel approach for programmatic control policy discovery, called Multimodal Large Language Model-assisted Evolutionary Search (MLES). MLES utilizes multimodal large language models as programmatic policy generators, combining them with evolutionary search to automate policy generation. It integrates visual feedback-driven behavior analysis within the policy generation process to identify failure patterns and guide targeted improvements, thereby enhancing policy discovery efficiency and producing adaptable, human-aligned policies. Experimental results demonstrate that MLES achieves performance comparable to Proximal Policy Optimization (PPO) across two standard control tasks while providing transparent control logic and traceable design processes. This approach also overcomes the limitations of predefined domain-specific languages, facilitates knowledge transfer and reuse, and is scalable across various tasks, showing promise as a new paradigm for developing transparent and verifiable control policies.
翻译:深度强化学习在控制任务中取得了显著成就。然而,其策略通常以不透明的神经网络表示,这使得人类难以理解、验证和调试,从而削弱了信任并阻碍了实际部署。本研究通过引入一种新颖的程序化控制策略发现方法——多模态大语言模型辅助进化搜索(MLES)——来应对这一挑战。MLES利用多模态大语言模型作为程序化策略生成器,并结合进化搜索实现策略生成的自动化。该方法在策略生成过程中整合了视觉反馈驱动的行为分析,以识别失败模式并指导针对性改进,从而提升策略发现效率并产生适应性强、与人类对齐的策略。实验结果表明,在两项标准控制任务中,MLES实现了与近端策略优化(PPO)相当的性能,同时提供了透明的控制逻辑和可追溯的设计流程。此方法还克服了预定义领域特定语言的局限性,促进了知识迁移与重用,并具备跨任务的可扩展性,为开发透明且可验证的控制策略提供了新的范式前景。