FineVLA: Fine-Grained Instruction Alignment for Steerable Vision-Language-Action Policies

Xintong Hu,Xuhong Huang,Jinyu Zhang,Yutong Yao,Yuchong Sun,Qiuyue Wang,Mingsheng Li,Sicheng Xie,Yitao Liu,Junhao Chen,Yixuan Chen,Yingming Zheng,Shuai Bai,Tao Yu

from arxiv, 26 pages, 7 figures, 25 tables

Vision-Language-Action (VLA) models are increasingly expected to not only complete robot tasks, but also follow human instructions about how those tasks should be executed. However, existing robot datasets usually pair trajectories with coarse goal-level language, leaving execution-critical details such as active arm, approach direction, and contact region unspecified. This limits steerable policy learning and robotic video understanding. We introduce FineVLA, an open framework for action-aligned fine-grained VLA supervision. The framework includes: (1) a data construction tool that unifies 972,247 trajectories across 85K tasks from 10 open-source robot datasets and builds FineVLA-Data, a human-verified dataset of 47,159 fine-grained trajectories; (2) a held-out benchmark with 500 videos, 11,631 atomic facts, and 1,030 VQA questions; (3) a robotics-specialized VLM annotator for scalable fine-grained annotation; and (4) a steerable VLA policy trained with controlled mixtures of fine-grained and raw goal-level instructions. Our experiments yield three findings. First, fine-grained supervision does not sacrifice goal-level success: FG-only improves over Raw-only by +1.4 to +8.1 success-rate points across settings. Second, fine-grained and raw instructions are complementary, following a consistent inverted-U trend peaking at FG:Raw = 1:2 to 1:1. The best mixed setting reaches 86.8%/82.5% in RoboTwin simulation and 62.7/100 in real-world dual-arm manipulation (vs. 49.9 Raw-only). Third, fine-grained supervision improves steerable control: the largest real-world gains appear on pose (+23), color (+18), and approach direction (+18)--factors where goal-level instructions provide no guidance. Overall, fine-grained language should augment goal-level instructions: specifying how to execute alongside what to achieve. Project page: https://finevla.xlang.ai/

翻译：视觉-语言-行动（VLA）模型日益被期望不仅完成机器人任务，还要遵循人类关于任务执行方式的指令。然而，现有机器人数据集通常将轨迹与粗粒度的目标级语言配对，导致诸如执行时主动臂、接近方向及接触区域等关键细节未明确指定。这限制了可操控策略学习与机器人视频理解。我们提出FineVLA——一个面向动作对齐的细粒度VLA监督开放框架。该框架包括：（1）一个数据构建工具，统一来自10个开源机器人数据集的972,247条轨迹（涵盖85K任务），并构建经人工验证的47,159条细粒度轨迹数据集FineVLA-Data；（2）一个保留测试基准，包含500个视频、11,631个原子事实及1,030个VQA问题；（3）一个面向机器人领域的专用VLM标注器，用于可扩展的细粒度标注；（4）一个通过混合细粒度与原始目标级指令进行可控训练的可操控VLA策略。我们的实验得出三项发现。首先，细粒度监督不会牺牲目标级成功率：在各设置下，仅用细粒度指令相比仅用原始指令，成功率提升1.4至8.1个百分点。其次，细粒度与原始指令具有互补性，呈现一致的倒U形趋势，在细粒度与原始指令比例为1:2至1:1处达到峰值。最优混合设置在RoboTwin仿真中达到86.8%/82.5%的成功率，在真实世界双臂操作中达到62.7/100（仅用原始指令时为49.9）。第三，细粒度监督改进了可操控控制：真实世界中的最大增益出现在姿态（+23）、颜色（+18）和接近方向（+18）——这些因素正是目标级指令无法提供指导的方面。总体而言，细粒度语言应增强目标级指令：在指定“完成什么”的同时明确“如何执行”。项目页面：https://finevla.xlang.ai/