RoboVQA: Multimodal Long-Horizon Reasoning for Robotics

Pierre Sermanet,Tianli Ding,Jeffrey Zhao,Fei Xia,Debidatta Dwibedi,Keerthana Gopalakrishnan,Christine Chan,Gabriel Dulac-Arnold,Sharath Maddineni,Nikhil J Joshi,Pete Florence,Wei Han,Robert Baruch,Yao Lu,Suvir Mirchandani,Peng Xu,Pannag Sanketi,Karol Hausman,Izhak Shafran,Brian Ichter,Yuan Cao

We present a scalable, bottom-up and intrinsically diverse data collection scheme that can be used for high-level reasoning with long and medium horizons and that has 2.2x higher throughput compared to traditional narrow top-down step-by-step collection. We collect realistic data by performing any user requests within the entirety of 3 office buildings and using multiple robot and human embodiments. With this data, we show that models trained on all embodiments perform better than ones trained on the robot data only, even when evaluated solely on robot episodes. We find that for a fixed collection budget it is beneficial to take advantage of cheaper human collection along with robot collection. We release a large and highly diverse (29,520 unique instructions) dataset dubbed RoboVQA containing 829,502 (video, text) pairs for robotics-focused visual question answering. We also demonstrate how evaluating real robot experiments with an intervention mechanism enables performing tasks to completion, making it deployable with human oversight even if imperfect while also providing a single performance metric. We demonstrate a single video-conditioned model named RoboVQA-VideoCoCa trained on our dataset that is capable of performing a variety of grounded high-level reasoning tasks in broad realistic settings with a cognitive intervention rate 46% lower than the zero-shot state of the art visual language model (VLM) baseline and is able to guide real robots through long-horizon tasks. The performance gap with zero-shot state-of-the-art models indicates that a lot of grounded data remains to be collected for real-world deployment, emphasizing the critical need for scalable data collection approaches. Finally, we show that video VLMs significantly outperform single-image VLMs with an average error rate reduction of 19% across all VQA tasks. Data and videos available at https://robovqa.github.io

翻译：我们提出一种可扩展、自下而上且天然多样化的数据收集方案，该方案可用于中长期的高层级推理，与传统自上而下的分步式窄域收集相比吞吐量提升2.2倍。通过在三个办公楼全域执行任意用户请求，并采用多机器人及人类具身形态，我们收集了真实场景数据。基于此数据，我们证明：即便仅在机器人片段上评估，使用所有具身形态训练的模型性能仍优于仅使用机器人数据训练的模型。研究发现，在固定收集预算下，结合低成本人类数据收集与机器人数据收集具有显著优势。我们发布了一个大规模、高多样性（包含29,520条独特指令）的数据集RoboVQA，内含829,502组面向机器人视觉问答的（视频，文本）对。同时展示如何通过干预机制评估真实机器人实验，使系统即便不完美也能在人类监督下完成任务部署，同时提供单一性能指标。我们提出单一视频条件模型RoboVQA-VideoCoCa，该模型基于我们的数据集训练，能够在广泛真实场景中执行多种具身高层推理任务，其认知干预率比零样本最先进的视觉语言模型基线低46%，并能指导真实机器人完成长时域任务。与零样本最先进模型的性能差距表明，真实世界部署仍需收集大量具身数据，这凸显了可扩展数据收集方法的迫切需求。最后，我们证明视频视觉语言模型在所有视觉问答任务上的平均错误率比单图像视觉语言模型降低19%，显著优于后者。数据和视频访问链接：https://robovqa.github.io