Advances in large vision-language models (VLMs) have stimulated growing interest in vision-language-action (VLA) systems for robot manipulation. However, existing manipulation datasets remain costly to curate, highly embodiment-specific, and insufficient in coverage and diversity, thereby hindering the generalization of VLA models. Recent approaches attempt to mitigate these limitations via a plan-then-execute paradigm, where high-level plans (e.g., subtasks, trace) are first generated and subsequently translated into low-level actions, but they critically rely on extra intermediate supervision, which is largely absent from existing datasets. To bridge this gap, we introduce the RoboInter Manipulation Suite, a unified resource including data, benchmarks, and models of intermediate representations for manipulation. It comprises RoboInter-Tool, a lightweight GUI that enables semi-automatic annotation of diverse representations, and RoboInter-Data, a large-scale dataset containing over 230k episodes across 571 diverse scenes, which provides dense per-frame annotations over more than 10 categories of intermediate representations, substantially exceeding prior work in scale and annotation quality. Building upon this foundation, RoboInter-VQA introduces 9 spatial and 20 temporal embodied VQA categories to systematically benchmark and enhance the embodied reasoning capabilities of VLMs. Meanwhile, RoboInter-VLA offers an integrated plan-then-execute framework, supporting modular and end-to-end VLA variants that bridge high-level planning with low-level execution via intermediate supervision. In total, RoboInter establishes a practical foundation for advancing robust and generalizable robotic learning via fine-grained and diverse intermediate representations.
翻译:大型视觉语言模型(VLMs)的进展激发了人们对用于机器人操作的视觉-语言-动作(VLA)系统日益增长的兴趣。然而,现有的操作数据集仍然存在构建成本高、高度依赖于具体机器人形态、以及覆盖范围和多样性不足的问题,从而阻碍了VLA模型的泛化能力。近期研究试图通过“先规划后执行”的范式来缓解这些限制,即先生成高层规划(如子任务、轨迹),随后将其转化为低层动作,但这些方法严重依赖额外的中间监督信号,而现有数据集大多缺乏此类标注。为弥补这一差距,我们推出了RoboInter操作套件,这是一个包含数据、基准测试和中间表示模型的统一资源。它包含RoboInter-Tool——一个支持对多样化表示进行半自动标注的轻量级图形用户界面工具,以及RoboInter-Data——一个包含571个多样化场景中超过23万条轨迹的大规模数据集,该数据集提供了超过10类中间表示的逐帧密集标注,在规模和标注质量上均显著超越先前工作。在此基础上,RoboInter-VQA引入了9类空间推理和20类时序推理的具身视觉问答类别,用于系统性地评估和增强VLMs的具身推理能力。同时,RoboInter-VLA提供了一个集成的“先规划后执行”框架,支持通过中间监督连接高层规划与低层执行的模块化及端到端VLA变体。总体而言,RoboInter通过细粒度且多样化的中间表示,为推进稳健且可泛化的机器人学习奠定了实用基础。