"Task Success" is not Enough: Investigating the Use of Video-Language Models as Behavior Critics for Catching Undesirable Agent Behaviors

Large-scale generative models are shown to be useful for sampling meaningful candidate solutions, yet they often overlook task constraints and user preferences. Their full power is better harnessed when the models are coupled with external verifiers and the final solutions are derived iteratively or progressively according to the verification feedback. In the context of embodied AI, verification often solely involves assessing whether goal conditions specified in the instructions have been met. Nonetheless, for these agents to be seamlessly integrated into daily life, it is crucial to account for a broader range of constraints and preferences beyond bare task success (e.g., a robot should grasp bread with care to avoid significant deformations). However, given the unbounded scope of robot tasks, it is infeasible to construct scripted verifiers akin to those used for explicit-knowledge tasks like the game of Go and theorem proving. This begs the question: when no sound verifier is available, can we use large vision and language models (VLMs), which are approximately omniscient, as scalable Behavior Critics to catch undesirable robot behaviors in videos? To answer this, we first construct a benchmark that contains diverse cases of goal-reaching yet undesirable robot policies. Then, we comprehensively evaluate VLM critics to gain a deeper understanding of their strengths and failure modes. Based on the evaluation, we provide guidelines on how to effectively utilize VLM critiques and showcase a practical way to integrate the feedback into an iterative process of policy refinement. The dataset and codebase are released at: https://guansuns.github.io/pages/vlm-critic.

翻译：大规模生成模型在采样有意义的候选解决方案方面表现出色，但常忽视任务约束与用户偏好。其全部潜能需借助外部验证器方能更充分地发挥，最终解决方案可根据验证反馈迭代或渐进式地推导得出。在具身人工智能领域，验证通常仅涉及评估指令中指定的目标条件是否达成。然而，为使这些智能体无缝融入日常生活，除任务成功本身外，还需考量更广泛的约束与偏好（例如，机器人应小心抓取面包以避免显著变形）。鉴于机器人任务的无界性，构建类似围棋、定理证明等显性知识任务中使用的脚本化验证器并不可行。这引发了一个关键问题：当缺乏可靠验证器时，能否利用近似全知的大型视觉-语言模型作为可扩展的行为评判者，从视频中捕获不当机器人行为？为回答此问题，我们首先构建了一个包含多种达成目标但策略不当的机器人案例的基准。随后，我们全面评估了基于视频-语言模型的评判者，深入探究其优势与失败模式。基于评估结果，我们提出了有效利用视频-语言模型评判反馈的指南，并展示了一种将反馈集成到策略迭代优化过程中的实用方法。数据集与代码库已发布于：https://guansuns.github.io/pages/vlm-critic。