Dexterous manipulation remains challenging due to the cost of collecting real-robot teleoperation data, the heterogeneity of hand embodiments, and the high dimensionality of control. We present UniDex, a robot foundation suite that couples a large-scale robot-centric dataset with a unified vision-language-action (VLA) policy and a practical human-data capture setup for universal dexterous hand control. First, we construct UniDex-Dataset, a robot-centric dataset over 50K trajectories across eight dexterous hands (6--24 DoFs), derived from egocentric human video datasets. To transform human data into robot-executable trajectories, we employ a human-in-the-loop retargeting procedure to align fingertip trajectories while preserving plausible hand-object contacts, and we operate on explicit 3D pointclouds with human hands masked to narrow kinematic and visual gaps. Second, we introduce the Function-Actuator-Aligned Space (FAAS), a unified action space that maps functionally similar actuators to shared coordinates, enabling cross-hand transfer. Leveraging FAAS as the action parameterization, we train UniDex-VLA, a 3D VLA policy pretrained on UniDex-Dataset and finetuned with task demonstrations. In addition, we build UniDex-Cap, a simple portable capture setup that records synchronized RGB-D streams and human hand poses and converts them into robot-executable trajectories to enable human-robot data co-training that reduces reliance on costly robot demonstrations. On challenging tool-use tasks across two different hands, UniDex-VLA achieves 81% average task progress and outperforms prior VLA baselines by a large margin, while exhibiting strong spatial, object, and zero-shot cross-hand generalization. Together, UniDex-Dataset, UniDex-VLA, and UniDex-Cap provide a scalable foundation suite for universal dexterous manipulation.
翻译:灵巧操作因真实机器人遥操作数据采集成本高昂、手部构型异质性强以及控制维度高而仍具挑战性。我们提出UniDex——一个将大规模机器人中心数据集、统一视觉-语言-动作(VLA)策略与实用人类数据采集装置相结合的机器人基础套件,用于实现通用灵巧手控制。首先,我们构建了UniDex-Dataset,一个以人为中心视频数据集为基础、涵盖八种灵巧手(6-24自由度)超过50K条轨迹的机器人中心数据集。为将人类数据转化为机器人可执行轨迹,我们采用人机协同重定向流程对齐指尖轨迹并保持合理的手-物接触,同时利用掩蔽人手的显式三维点云缩小运动与视觉差异。其次,我们提出功能-执行器对齐空间(FAAS),这是一个将功能相似执行器映射至共享坐标的统一动作空间,实现了跨手迁移。以FAAS作为动作参数化基础,我们训练了UniDex-VLA——一个在UniDex-Dataset上预训练并通过任务演示微调的三维VLA策略。此外,我们构建了UniDex-Cap,一个记录同步RGB-D流与手部姿态的简易便携采集装置,可将其转换为机器人可执行轨迹以实现人机数据协同训练,减少对昂贵机器人演示数据的依赖。在涉及两只不同手的复杂工具使用任务中,UniDex-VLA实现了81%的平均任务进度,大幅超越先前VLA基线方法,并展现出强大的空间泛化、物体泛化及零样本跨手泛化能力。UniDex-Dataset、UniDex-VLA与UniDex-Cap共同为通用灵巧操作提供了可扩展的基础套件。