RoboInter：面向机器人操作的整体性中间表示套件 (RoboInter: A Holistic Intermediate Representation Suite Towards Robotic Manipulation)

Advances in large vision-language models (VLMs) have stimulated growing interest in vision-language-action (VLA) systems for robot manipulation. However, existing manipulation datasets remain costly to curate, highly embodiment-specific, and insufficient in coverage and diversity, thereby hindering the generalization of VLA models. Recent approaches attempt to mitigate these limitations via a plan-then-execute paradigm, where high-level plans (e.g., subtasks, trace) are first generated and subsequently translated into low-level actions, but they critically rely on extra intermediate supervision, which is largely absent from existing datasets. To bridge this gap, we introduce the RoboInter Manipulation Suite, a unified resource including data, benchmarks, and models of intermediate representations for manipulation. It comprises RoboInter-Tool, a lightweight GUI that enables semi-automatic annotation of diverse representations, and RoboInter-Data, a large-scale dataset containing over 230k episodes across 571 diverse scenes, which provides dense per-frame annotations over more than 10 categories of intermediate representations, substantially exceeding prior work in scale and annotation quality. Building upon this foundation, RoboInter-VQA introduces 9 spatial and 20 temporal embodied VQA categories to systematically benchmark and enhance the embodied reasoning capabilities of VLMs. Meanwhile, RoboInter-VLA offers an integrated plan-then-execute framework, supporting modular and end-to-end VLA variants that bridge high-level planning with low-level execution via intermediate supervision. In total, RoboInter establishes a practical foundation for advancing robust and generalizable robotic learning via fine-grained and diverse intermediate representations.

翻译：大型视觉语言模型（VLMs）的进展激发了人们对用于机器人操作的视觉-语言-动作（VLA）系统日益增长的兴趣。然而，现有的操作数据集仍然存在构建成本高、高度依赖于具体机器人形态、以及覆盖范围和多样性不足的问题，从而阻碍了VLA模型的泛化能力。近期研究试图通过“先规划后执行”的范式来缓解这些限制，即先生成高层规划（如子任务、轨迹），随后将其转化为低层动作，但这些方法严重依赖额外的中间监督信号，而现有数据集大多缺乏此类标注。为弥补这一差距，我们推出了RoboInter操作套件，这是一个包含数据、基准测试和中间表示模型的统一资源。它包含RoboInter-Tool——一个支持对多样化表示进行半自动标注的轻量级图形用户界面工具，以及RoboInter-Data——一个包含571个多样化场景中超过23万条轨迹的大规模数据集，该数据集提供了超过10类中间表示的逐帧密集标注，在规模和标注质量上均显著超越先前工作。在此基础上，RoboInter-VQA引入了9类空间推理和20类时序推理的具身视觉问答类别，用于系统性地评估和增强VLMs的具身推理能力。同时，RoboInter-VLA提供了一个集成的“先规划后执行”框架，支持通过中间监督连接高层规划与低层执行的模块化及端到端VLA变体。总体而言，RoboInter通过细粒度且多样化的中间表示，为推进稳健且可泛化的机器人学习奠定了实用基础。