CORE：基于代码的逆向自训练框架与图扩展方法用于虚拟智能体 (CORE: Code-based Inverse Self-Training Framework with Graph Expansion for Virtual Agents)

The development of Multimodal Virtual Agents has made significant progress through the integration of Multimodal Large Language Models. However, mainstream training paradigms face key challenges: Behavior Cloning is simple and effective through imitation but suffers from low behavioral diversity, while Reinforcement Learning is capable of discovering novel strategies through exploration but heavily relies on manually designed reward functions. To address the conflict between these two methods, we present CORE, a Code-based Inverse Self-Training Framework with Graph Expansion that bridges imitation and exploration, offering a novel training framework that promotes behavioral diversity while eliminating the reliance on manually reward design. Specifically, we introduce Semantic Code Abstraction to automatically infers reward functions from expert demonstrations without manual design. The inferred reward function, referred to as the Label Function, is executable code that verifies one key step within a task. Building on this, we propose Strategy Graph Expansion to enhance in-domain behavioral diversity, which constructs a multi-path graph called Strategy Graph that captures diverse valid solutions beyond expert demonstrations. Furthermore, we introduce Trajectory-Guided Extrapolation, which enriches out-of-domain behavioral diversity by utilizing both successful and failed trajectories to expand the task space. Experiments on Web and Android platforms demonstrate that CORE significantly improves both overall performance and generalization, highlighting its potential as a robust and generalizable training paradigm for building powerful virtual agents.

翻译：通过整合多模态大语言模型，多模态虚拟智能体的发展已取得显著进展。然而，主流训练范式面临关键挑战：行为克隆通过模仿实现简单有效，但存在行为多样性不足的问题；而强化学习能够通过探索发现新策略，却严重依赖人工设计的奖励函数。为解决这两种方法之间的冲突，我们提出了CORE——一种基于代码的逆向自训练框架与图扩展方法，该框架在模仿与探索之间架起桥梁，提供了一种既能提升行为多样性、又可消除对人工奖励设计依赖的新型训练框架。具体而言，我们引入语义代码抽象技术，能够从专家示范中自动推断奖励函数而无需人工设计。推断出的奖励函数（称为标签函数）是可执行代码，用于验证任务中的关键步骤。在此基础上，我们提出策略图扩展方法以增强领域内行为多样性，该方法构建了一个称为策略图的多路径图，能够捕捉超越专家示范的多种有效解决方案。此外，我们引入轨迹引导外推技术，通过利用成功与失败的轨迹来扩展任务空间，从而丰富领域外行为多样性。在Web和Android平台上的实验表明，CORE显著提升了整体性能与泛化能力，凸显了其作为构建强大虚拟智能体的稳健且可泛化训练范式的潜力。