Vision-Language-Action (VLA) models have recently become a central direction in embodied AI, but current systems are restricted to either dual-gripper control or single-arm dexterous hand manipulation. While low-dimensional gripper control can often be handled with simpler methods, high-dimensional dexterous hand control benefits greatly from full end-to-end VLA learning. In this work, we introduce Dexora, the first open-source VLA system that natively targets dual-arm, dual-hand high-DoF manipulation. We design a hybrid teleoperation pipeline that decouples gross arm kinematics (captured with a custom exoskeleton backpack) from fine finger motion (markerless hand tracking via Apple Vision Pro), and that drives both a physical dual-arm dual-hand platform and an identical MuJoCo digital twin. Using that interface, we assemble a large training corpus: an embodiment-matched synthetic corpus (100K simulated trajectories, 6.5M frames) and a real-world dataset of 10K teleoperated episodes (2.92M frames). To mitigate noisy teleoperation demonstrations, we propose a data-quality-aware training recipe: an offline discriminator provides clip-level weights for diffusion-transformer policy training, down-weighting low-quality demonstrations. Empirically, Dexora outperforms competitive VLA baselines on both basic and dexterous benchmarks (e.g., average dexterous success 66.7% vs. 51.7%), attains 90% success on basic tasks, and shows robust out-of-distribution and cross-embodiment generalization. Ablations confirm the importance of real data and the discriminator for dexterity.
翻译:视觉-语言-动作模型近期已成为具身智能领域的核心方向,但现有系统仅限于双夹爪控制或单臂灵巧手操作。尽管低维夹爪控制可通过简单方法实现,但高维灵巧手操作能显著受益于完整的端到端VLA学习。本文提出Dexora——首个原生面向双臂双手高自由度操作的开源VLA系统。我们设计了一种混合遥操作流水线,将粗手臂运动学(通过定制外骨骼背包捕获)与精细手指运动(通过Apple Vision Pro实现无标记手部追踪)解耦,并同步驱动实体双臂双手平台及其MuJoCo数字孪生模型。基于该接口,我们构建了大规模训练语料库:包含10万条模拟轨迹(650万帧)的具身匹配合成数据集与1万条遥操作真实世界示范片段(292万帧)。为克服遥操作示范中的噪声问题,我们提出数据质量感知训练方案:离线判别器为扩散Transformer策略训练提供片段级权重,降低低质量示范的影响。实验表明,Dexora在基础与灵巧两类基准测试中均超越主流VLA基线(如灵巧操作平均成功率66.7%对比51.7%),基础任务成功率达90%,并展现出稳健的分布外与跨形态泛化能力。消融实验证实了真实数据与判别器对灵巧操作的重要性。