Designing dense reward functions is pivotal for efficient robotic Reinforcement Learning (RL). However, most dense rewards rely on manual engineering, which fundamentally limits the scalability and automation of reinforcement learning. While Vision-Language Models (VLMs) offer a promising path to reward design, naive VLM rewards often misalign with task progress, struggle with spatial grounding, and show limited understanding of task semantics. To address these issues, we propose MARVL-Multi-stAge guidance for Robotic manipulation via Vision-Language models. MARVL fine-tunes a VLM for spatial and semantic consistency and decomposes tasks into multi-stage subtasks with task direction projection for trajectory sensitivity. Empirically, MARVL significantly outperforms existing VLM-reward methods on the Meta-World benchmark, demonstrating superior sample efficiency and robustness on sparse-reward manipulation tasks.
翻译:设计密集奖励函数对于高效的机器人强化学习至关重要。然而,大多数密集奖励依赖于人工设计,这从根本上限制了强化学习的可扩展性和自动化。尽管视觉语言模型为奖励设计提供了一条有前景的路径,但简单的VLM奖励通常与任务进展不匹配,难以实现空间定位,并且对任务语义的理解有限。为解决这些问题,我们提出了MARVL——基于视觉语言模型的多阶段机器人操作引导。MARVL通过微调VLM以实现空间和语义一致性,并将任务分解为多阶段子任务,同时利用任务方向投影来增强轨迹敏感性。实验表明,在Meta-World基准测试中,MARVL显著优于现有的VLM奖励方法,在稀疏奖励操作任务上展现出卓越的样本效率和鲁棒性。