LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories

Baochang Ren,Xinjie Liu,Xi Chen,Yanshuo Liu,Chenxi Li,Daqi Gao,Zeqin Su,Jintao Xing,Zirui Xue,Rui Li,Xiangyu Zhao,Shuofei Qiao,Minting Pan,Wangmeng Zuo,Lei Bai,Dongzhan Zhou,Ningyu Zhang,Huajun Chen

from arxiv, Work in progress. Project website at https://zjunlp.github.io/LabVLA/

Scientific laboratories increasingly rely on AI systems to reason about experiments, but the physical act of doing science remains largely outside their reach. AI can help read literature, generate hypotheses, and plan protocols, yet the execution of those protocols at the bench still requires a human operator. Vision-Language-Action (VLA) models provide one possible interface between written protocols and robot execution, but existing policies are trained mostly on household and tabletop demonstrations and rarely encounter the instruments, transparent liquids, or fixed protocol workflows found in scientific laboratories. Closing this gap requires both laboratory-specific supervision and a unified learning framework that can accommodate the diverse robot embodiments used to execute experimental protocols. We therefore identify data and embodiment as central bottlenecks alongside model design. To address the data side, we build RoboGenesis, a simulation-based workflow and data engine that composes configured laboratory workflows from atomic skills, validates and filters rollouts, and exports structured demonstrations across supported robot profiles. On the policy side, we present LabVLA, trained with a two-stage recipe: FAST action token pretraining first makes the Qwen3-VL-4B-Instruct backbone action aware before any continuous control is learned, and flow matching posttraining then attaches a DiT action expert under knowledge insulation. On the LabUtopia benchmark, LabVLA achieves the highest average success rate among all evaluated baselines under both in-distribution and out-of-distribution settings.

翻译：科学实验室日益依赖人工智能系统对实验进行推理，但实验操作本身仍基本超出其能力范围。AI能够辅助文献阅读、假设生成和方案设计，然而在实验台上执行这些方案仍需要人类操作员。视觉-语言-动作（VLA）模型为书面方案与机器人执行之间提供了一种可能的接口，但现有策略大多基于家庭和桌面场景的演示训练，很少涉及科学实验室中的仪器、透明液体或固定流程协议。弥合这一差距既需要实验室特定的监督信号，也需要一个统一的学习框架来容纳执行实验方案时使用的多种机器人具身形态。为此，我们认定数据和具身形态是与模型设计并列的核心瓶颈。在数据方面，我们构建了RoboGenesis——一个基于模拟的工作流和数据引擎，该引擎从原子技能组合配置化的实验室工作流，验证并过滤rollout结果，并在所支持的机器人配置文件中导出结构化的演示数据。在策略方面，我们提出了LabVLA，采用两阶段训练策略：首先进行FAST动作标记预训练，使Qwen3-VL-4B-Instruct骨干网络在学习任何连续控制之前具备动作感知能力；随后进行流匹配后训练，在知识隔离条件下附加一个DiT动作专家模块。在LabUtopia基准上，LabVLA在分布内和分布外设置下均取得了所有评估基线中最高的平均成功率。