Intelligent exploration remains a critical challenge in reinforcement learning (RL), especially in visual control tasks. Unlike low-dimensional state-based RL, visual RL must extract task-relevant structure from raw pixels, making exploration inefficient. We propose Concept-Driven Exploration (CDE), which leverages a pre-trained vision-language model (VLM) to generate object-centric visual concepts from textual task descriptions as weak, potentially noisy supervisory signals. Rather than directly conditioning on these noisy signals, CDE trains a policy to reconstruct the concepts via an auxiliary objective, using reconstruction accuracy as an intrinsic reward to guide exploration toward task-relevant objects. Because the policy internalizes these concepts, VLM queries are only needed during training, reducing dependence on external models during deployment. Across five challenging simulated visual manipulation tasks, CDE achieves efficient, targeted exploration and remains robust to noisy VLM predictions. Finally, we demonstrate real-world transfer by deploying CDE on a Franka Research 3 arm, attaining an 80\% success rate in a real-world manipulation task.
翻译:智能探索在强化学习(RL)中仍然是一个关键挑战,尤其是在视觉控制任务中。与基于低维状态的RL不同,视觉RL必须从原始像素中提取与任务相关的结构,这使得探索效率低下。我们提出了概念驱动探索(CDE),该方法利用预训练的视觉语言模型(VLM)从文本任务描述中生成以物体为中心的视觉概念,作为弱且可能带有噪声的监督信号。CDE并非直接以这些噪声信号为条件,而是通过一个辅助目标训练策略来重建这些概念,并将重建精度用作内在奖励,以引导探索朝向与任务相关的物体。由于策略内化了这些概念,仅在训练期间需要VLM查询,从而减少了部署时对外部模型的依赖。在五个具有挑战性的模拟视觉操作任务中,CDE实现了高效、有针对性的探索,并且对VLM的噪声预测保持鲁棒性。最后,我们通过在Franka Research 3机械臂上部署CDE,在一个真实世界操作任务中达到了80%的成功率,展示了其向现实世界的迁移能力。