Dexora: Open-source VLA for High-DoF Bimanual Dexterity

Zongzheng Zhang,Jingrui Pang,Zhuo Yang,Kun Li,Minwen Liao,Saining Zhang,Guoxuan Chi,Jinbang Guo,Huan-ang Gao,Modi Shi,Dongyun Ge,Yao Mu,Jiayuan Gu,Rui Chen,Hao Dong,Huazhe Xu,Li Yi,Yixin Zhu,Hang Zhao,Pengwei Wang,Shanghang Zhang,Guocai Yao,Jianyu Chen,Hongyang Li,Hao Zhao

from arxiv, Accpeted by ICRA 2026

Vision-Language-Action (VLA) models have recently become a central direction in embodied AI, but current systems are restricted to either dual-gripper control or single-arm dexterous hand manipulation. While low-dimensional gripper control can often be handled with simpler methods, high-dimensional dexterous hand control benefits greatly from full end-to-end VLA learning. In this work, we introduce Dexora, the first open-source VLA system that natively targets dual-arm, dual-hand high-DoF manipulation. We design a hybrid teleoperation pipeline that decouples gross arm kinematics (captured with a custom exoskeleton backpack) from fine finger motion (markerless hand tracking via Apple Vision Pro), and that drives both a physical dual-arm dual-hand platform and an identical MuJoCo digital twin. Using that interface, we assemble a large training corpus: an embodiment-matched synthetic corpus (100K simulated trajectories, 6.5M frames) and a real-world dataset of 10K teleoperated episodes (2.92M frames). To mitigate noisy teleoperation demonstrations, we propose a data-quality-aware training recipe: an offline discriminator provides clip-level weights for diffusion-transformer policy training, down-weighting low-quality demonstrations. Empirically, Dexora outperforms competitive VLA baselines on both basic and dexterous benchmarks (e.g., average dexterous success 66.7% vs. 51.7%), attains 90% success on basic tasks, and shows robust out-of-distribution and cross-embodiment generalization. Ablations confirm the importance of real data and the discriminator for dexterity.

翻译：视觉-语言-动作模型近期已成为具身智能领域的核心方向，但现有系统仅限于双夹爪控制或单臂灵巧手操作。尽管低维夹爪控制可通过简单方法实现，但高维灵巧手操作能显著受益于完整的端到端VLA学习。本文提出Dexora——首个原生面向双臂双手高自由度操作的开源VLA系统。我们设计了一种混合遥操作流水线，将粗手臂运动学（通过定制外骨骼背包捕获）与精细手指运动（通过Apple Vision Pro实现无标记手部追踪）解耦，并同步驱动实体双臂双手平台及其MuJoCo数字孪生模型。基于该接口，我们构建了大规模训练语料库：包含10万条模拟轨迹（650万帧）的具身匹配合成数据集与1万条遥操作真实世界示范片段（292万帧）。为克服遥操作示范中的噪声问题，我们提出数据质量感知训练方案：离线判别器为扩散Transformer策略训练提供片段级权重，降低低质量示范的影响。实验表明，Dexora在基础与灵巧两类基准测试中均超越主流VLA基线（如灵巧操作平均成功率66.7%对比51.7%），基础任务成功率达90%，并展现出稳健的分布外与跨形态泛化能力。消融实验证实了真实数据与判别器对灵巧操作的重要性。