A1: A Fully Transparent Open-Source, Adaptive and Efficient Truncated Vision-Language-Action Model

Kaidong Zhang,Jian Zhang,Rongtao Xu,Yu Sun,Shuoshuo Xue,Youpeng Wen,Xiaoyu Guo,Minghao Guo,Weijia Liufu,Liu Zihou,Kangyi Ji,Yangsong Zhang,Jiarun Zhu,Jingzhi Liu,Zihang Li,Ruiyi Chen,Meng Cao,Jingming Zhang,Shen Zhao,Xiaojun Chang,Feng Zheng,Ivan Laptev,Xiaodan Liang

Vision--Language--Action (VLA) models have emerged as a powerful paradigm for open-world robot manipulation, but their practical deployment is often constrained by \emph{cost}: billion-scale VLM backbones and iterative diffusion/flow-based action heads incur high latency and compute, making real-time control expensive on commodity hardware. We present A1, a fully open-source and transparent VLA framework designed for low-cost, high-throughput inference without sacrificing manipulation success; Our approach leverages pretrained VLMs that provide implicit affordance priors for action generation. We release the full training stack (training code, data/data-processing pipeline, intermediate checkpoints, and evaluation scripts) to enable end-to-end reproducibility. Beyond optimizing the VLM alone, A1 targets the full inference pipeline by introducing a budget-aware adaptive inference scheme that jointly accelerates the backbone and the \emph{action head}. Specifically, we monitor action consistency across intermediate VLM layers to trigger early termination, and propose Inter-Layer Truncated Flow Matching that warm-starts denoising across layers, enabling accurate actions with substantially fewer effective denoising iterations. Across simulation benchmarks (LIBERO, VLABench) and real robots (Franka, AgiBot), A1 achieves state-of-the-art success rates while significantly reducing inference cost (e.g., up to 72% lower per-episode latency for flow-matching inference and up to 76.6% backbone computation reduction with minor performance degradation). On RoboChallenge, A1 achieves an average success rate of 29.00%, outperforming baselines including pi0(28.33%), X-VLA (21.33%), and RDT-1B (15.00%).

翻译：视觉-语言-动作（VLA）模型已成为开放世界机器人操作中的强大范式，但其实际部署常受限于“成本”：十亿级视觉语言模型（VLM）主干网络与迭代式扩散/流匹配动作头导致高延迟和高计算开销，使得在普通硬件上实现实时控制代价高昂。本文提出A1，一个完全开源且透明的VLA框架，旨在不牺牲操作成功率的前提下实现低成本、高吞吐量推理。我们的方法利用预训练VLM提供隐式可供性先验以生成动作。我们发布完整的训练堆栈（训练代码、数据/数据处理管线、中间检查点及评估脚本），以实现端到端可复现性。除优化VLM本身外，A1针对完整推理管线引入一种预算感知的自适应推理方案，联合加速主干网络与“动作头”。具体而言，我们监控跨VLM中间层的动作一致性以触发提前终止，并提出跨层截断式流匹配（Inter-Layer Truncated Flow Matching），在层间实现去噪热启动，从而以显著更少的有效去噪迭代次数生成精确动作。在仿真基准测试（LIBERO、VLABench）和真实机器人（Franka、AgiBot）上，A1在显著降低推理成本的同时实现了最先进的成功率（例如，流匹配推理的每回合延迟最高降低72%，主干网络计算量最高减少76.6%，且性能退化极小）。在RoboChallenge上，A1的平均成功率达到29.00%，优于包括pi0（28.33%）、X-VLA（21.33%）和RDT-1B（15.00%）在内的基线方法。