Planning physically feasible dexterous hand manipulation is a central challenge in robotic manipulation and Embodied AI. Prior work typically relies on object-centric cues or precise hand-object interaction sequences, foregoing the rich, compositional guidance of open-vocabulary instruction. We introduce UniHM, the first framework for unified dexterous hand manipulation guided by free-form language commands. We propose a Unified Hand-Dexterous Tokenizer that maps heterogeneous dexterous-hand morphologies into a single shared codebook, improving cross-dexterous hand generalization and scalability to new morphologies. Our vision language action model is trained solely on human-object interaction data, eliminating the need for massive real-world teleoperation datasets, and demonstrates strong generalizability in producing human-like manipulation sequences from open-ended language instructions. To ensure physical realism, we introduce a physics-guided dynamic refinement module that performs segment-wise joint optimization under generative and temporal priors, yielding smooth and physically feasible manipulation sequences. Across multiple datasets and real-world evaluations, UniHM attains state-of-the-art results on both seen and unseen objects and trajectories, demonstrating strong generalization and high physical feasibility. Our project page at \href{https://unihm.github.io/}{https://unihm.github.io/}.
翻译:规划物理可行的灵巧手操控是机器人操控与具身人工智能领域的核心挑战。现有工作通常依赖物体中心线索或精确的手-物体交互序列,舍弃了开放词汇指令所蕴含的丰富组合式引导。我们提出了UniHM,首个基于自由形式语言指令引导的统一灵巧手操控框架。我们设计了一种统一灵巧手分词器,能够将异构的灵巧手形态映射到单一共享码本中,从而提升跨灵巧手的泛化能力以及对新形态的可扩展性。我们的视觉语言动作模型仅需在人类-物体交互数据上进行训练,无需大规模真实世界遥操作数据集,并在根据开放式语言指令生成类人操控序列方面展现出强大的泛化能力。为确保物理真实性,我们引入了一个物理引导的动态优化模块,该模块在生成先验与时序先验下执行分段联合优化,从而产生平滑且物理可行的操控序列。在多个数据集及真实世界评估中,UniHM在已见和未见物体与轨迹上均取得了最先进的结果,展现出强大的泛化能力与高度的物理可行性。我们的项目页面位于 \href{https://unihm.github.io/}{https://unihm.github.io/}。