Recent advances in Large Language Models (LLMs) and Vision-Language Models (VLMs) have enabled powerful semantic and multimodal reasoning capabilities, creating new opportunities to enhance sample efficiency, high-level planning, and interpretability in reinforcement learning (RL). While prior work has integrated LLMs and VLMs into various components of RL, the replay buffer, a core component for storing and reusing experiences, remains unexplored. We propose addressing this gap by leveraging VLMs to guide the prioritization of experiences in the replay buffer. Our key idea is to use a frozen, pre-trained VLM (requiring no fine-tuning) as an automated evaluator to identify and prioritize promising sub-trajectories from the agent's experiences. Across scenarios, including game-playing and robotics, spanning both discrete and continuous domains, agents trained with our proposed prioritization method achieve 11-52% higher average success rates and improve sample efficiency by 19-45% compared to previous approaches. https://esharony.me/projects/vlm-rb/
翻译:大型语言模型(LLM)与视觉语言模型(VLM)的最新进展赋予了强大的语义与多模态推理能力,为提升强化学习(RL)中的样本效率、高层规划与可解释性创造了新的机遇。尽管先前研究已将LLM与VLM集成至RL的多个组件中,但作为存储与复用经验的核心组件——回放缓冲区——尚未得到探索。我们提出利用VLM来引导回放缓冲区中经验的优先级排序,以填补这一空白。我们的核心思想是使用一个冻结的、预训练的VLM(无需微调)作为自动化评估器,从智能体的经验中识别并优先处理有前景的子轨迹。在涵盖游戏与机器人等多个场景(包括离散与连续领域)的实验中,采用我们提出的优先级排序方法训练的智能体,其平均成功率较先前方法提升了11-52%,样本效率提高了19-45%。https://esharony.me/projects/vlm-rb/