Designing suitable rewards poses a significant challenge in reinforcement learning (RL), especially for embodied manipulation. Trajectory success rewards are suitable for human judges or model fitting, but the sparsity severely limits RL sample efficiency. While recent methods have effectively improved RL via dense rewards, they rely heavily on high-quality human-annotated data or abundant expert supervision. To tackle these issues, this paper proposes Dual-granularity contrastive reward via generated Episodic Guidance (DEG), a novel framework to seek sample-efficient dense rewards without requiring human annotations or extensive supervision. Leveraging the prior knowledge of large video generation models, DEG only needs a small number of expert videos for domain adaptation to generate dedicated task guidance for each RL episode. Then, the proposed dual-granularity reward that balances coarse-grained exploration and fine-grained matching, will guide the agent to efficiently approximate the generated guidance video sequentially in the contrastive self-supervised latent space, and finally complete the target task. Extensive experiments on 18 diverse tasks across both simulation and real-world settings show that DEG can not only serve as an efficient exploration stimulus to help the agent quickly discover sparse success rewards, but also guide effective RL and stable policy convergence independently.
翻译:设计合适的奖励函数是强化学习(RL)中的一个关键挑战,尤其在具身操作任务中。基于轨迹成功与否的奖励适用于人类评判或模型拟合,但其稀疏性严重限制了RL的样本效率。尽管近期方法通过密集奖励有效改进了RL性能,但这些方法严重依赖高质量的人工标注数据或大量的专家监督。为解决这些问题,本文提出一种通过生成式情景引导实现的双粒度对比奖励框架(DEG),该框架能够在无需人工标注或大量监督的情况下,寻求样本高效的密集奖励。DEG利用大型视频生成模型的先验知识,仅需少量专家视频进行领域适应,即可为每个RL训练片段生成专用的任务引导视频。随后,所提出的双粒度奖励机制——平衡粗粒度探索与细粒度匹配——将引导智能体在对比自监督的潜在空间中,高效地逐步逼近生成的引导视频,最终完成目标任务。在涵盖仿真与真实环境的18个多样化任务上进行的大量实验表明,DEG不仅能作为高效的探索激励,帮助智能体快速发现稀疏的成功奖励,还能独立引导有效的RL训练并实现稳定的策略收敛。