I-FailSense: Towards General Robotic Failure Detection with Vision-Language Models

Language-conditioned robotic manipulation in open-world settings requires not only accurate task execution but also the ability to detect failures for robust deployment in real-world environments. Although recent advances in vision-language models (VLMs) have significantly improved the spatial reasoning and task-planning capabilities of robots, they remain limited in their ability to recognize their own failures. In particular, a critical yet underexplored challenge lies in detecting semantic misalignment errors, where the robot executes a task that is semantically meaningful but inconsistent with the given instruction. To address this, we propose a method for building datasets targeting Semantic Misalignment Failures detection, from existing language-conditioned manipulation datasets. We also present I-FailSense, an open-source VLM framework with grounded arbitration designed specifically for failure detection. Our approach relies on post-training a base VLM, followed by training lightweight classification heads, called FS blocks, attached to different internal layers of the VLM and whose predictions are aggregated using an ensembling mechanism. Experiments show that I-FailSense outperforms state-of-the-art VLMs, both comparable in size and larger, in detecting semantic misalignment errors. Notably, despite being trained only on semantic misalignment detection, I-FailSense generalizes to broader robotic failure categories and effectively transfers to other simulation environments and real-world with zero-shot or minimal post-training. The datasets and models are publicly released on HuggingFace (Webpage: https://clemgris.github.io/I-FailSense/).

翻译：开放世界中的语言条件机器人操控不仅需要精确的任务执行能力，还需具备故障检测功能，以实现现实环境中的鲁棒部署。尽管视觉-语言模型（VLMs）的最新进展显著提升了机器人的空间推理与任务规划能力，但其在识别自身故障方面仍存在局限。尤其值得关注的是，语义失配错误检测这一关键挑战尚未得到充分探索——即机器人执行的任务虽具有语义意义，却与给定指令不一致。为此，我们提出一种基于现有语言条件操控数据集构建语义失配故障检测数据集的方法，并推出I-FailSense：一个专为故障检测设计的、具备基础仲裁机制的开源VLM框架。该方法首先对基础VLM进行后训练，随后在VLM不同内部层级附加轻量级分类头（称为FS模块），通过集成机制聚合各模块的预测结果。实验表明，在检测语义失配错误方面，I-FailSense优于当前最先进的VLM（包括参数量相当及更大的模型）。值得注意的是，尽管仅针对语义失配检测进行训练，I-FailSense能够泛化至更广泛的机器人故障类别，并可通过零样本或少量后训练有效迁移至其他仿真环境及现实场景。相关数据集与模型已在HuggingFace平台开源发布（项目主页：https://clemgris.github.io/I-FailSense/）。