UniDex: A Robot Foundation Suite for Universal Dexterous Hand Control from Egocentric Human Videos

Gu Zhang,Qicheng Xu,Haozhe Zhang,Jianhan Ma,Long He,Yiming Bao,Zeyu Ping,Zhecheng Yuan,Chenhao Lu,Chengbo Yuan,Tianhai Liang,Xiaoyu Tian,Maanping Shao,Feihong Zhang,Mingyu Ding,Yang Gao,Hao Zhao,Hang Zhao,Huazhe Xu

from arxiv, Accepted by CVPR 2026

Dexterous manipulation remains challenging due to the cost of collecting real-robot teleoperation data, the heterogeneity of hand embodiments, and the high dimensionality of control. We present UniDex, a robot foundation suite that couples a large-scale robot-centric dataset with a unified vision-language-action (VLA) policy and a practical human-data capture setup for universal dexterous hand control. First, we construct UniDex-Dataset, a robot-centric dataset over 50K trajectories across eight dexterous hands (6--24 DoFs), derived from egocentric human video datasets. To transform human data into robot-executable trajectories, we employ a human-in-the-loop retargeting procedure to align fingertip trajectories while preserving plausible hand-object contacts, and we operate on explicit 3D pointclouds with human hands masked to narrow kinematic and visual gaps. Second, we introduce the Function-Actuator-Aligned Space (FAAS), a unified action space that maps functionally similar actuators to shared coordinates, enabling cross-hand transfer. Leveraging FAAS as the action parameterization, we train UniDex-VLA, a 3D VLA policy pretrained on UniDex-Dataset and finetuned with task demonstrations. In addition, we build UniDex-Cap, a simple portable capture setup that records synchronized RGB-D streams and human hand poses and converts them into robot-executable trajectories to enable human-robot data co-training that reduces reliance on costly robot demonstrations. On challenging tool-use tasks across two different hands, UniDex-VLA achieves 81% average task progress and outperforms prior VLA baselines by a large margin, while exhibiting strong spatial, object, and zero-shot cross-hand generalization. Together, UniDex-Dataset, UniDex-VLA, and UniDex-Cap provide a scalable foundation suite for universal dexterous manipulation.

翻译：灵巧操作因真实机器人遥操作数据采集成本高昂、手部构型异质性强以及控制维度高而仍具挑战性。我们提出UniDex——一个将大规模机器人中心数据集、统一视觉-语言-动作（VLA）策略与实用人类数据采集装置相结合的机器人基础套件，用于实现通用灵巧手控制。首先，我们构建了UniDex-Dataset，一个以人为中心视频数据集为基础、涵盖八种灵巧手（6-24自由度）超过50K条轨迹的机器人中心数据集。为将人类数据转化为机器人可执行轨迹，我们采用人机协同重定向流程对齐指尖轨迹并保持合理的手-物接触，同时利用掩蔽人手的显式三维点云缩小运动与视觉差异。其次，我们提出功能-执行器对齐空间（FAAS），这是一个将功能相似执行器映射至共享坐标的统一动作空间，实现了跨手迁移。以FAAS作为动作参数化基础，我们训练了UniDex-VLA——一个在UniDex-Dataset上预训练并通过任务演示微调的三维VLA策略。此外，我们构建了UniDex-Cap，一个记录同步RGB-D流与手部姿态的简易便携采集装置，可将其转换为机器人可执行轨迹以实现人机数据协同训练，减少对昂贵机器人演示数据的依赖。在涉及两只不同手的复杂工具使用任务中，UniDex-VLA实现了81%的平均任务进度，大幅超越先前VLA基线方法，并展现出强大的空间泛化、物体泛化及零样本跨手泛化能力。UniDex-Dataset、UniDex-VLA与UniDex-Cap共同为通用灵巧操作提供了可扩展的基础套件。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

面向机器人操作的基于大型视觉‑语言模型（VLM）的视觉‑语言‑动作（VLA）模型综述

专知会员服务

34+阅读 · 2025年8月19日

【CVPR2025】RoboTwin：具备生成式数字孪生的双臂机器人基准平台

专知会员服务

12+阅读 · 2025年4月19日

CVPR 2025 Highlight | OmniManip：以对象为中心的机器人通用操作框架

专知会员服务

9+阅读 · 2025年4月15日

《无人战术自主控制与协作（UTACC）人机通信和态势感知》92页

专知会员服务

51+阅读 · 2024年11月30日