CLUTCH: Contextualized Language model for Unlocking Text-Conditioned Hand motion modelling in the wild

Hands play a central role in daily life, yet modeling natural hand motions remains underexplored. Existing methods that tackle text-to-hand-motion generation or hand animation captioning rely on studio-captured datasets with limited actions and contexts, making them costly to scale to "in-the-wild" settings. Further, contemporary models and their training schemes struggle to capture animation fidelity with text-motion alignment. To address this, we (1) introduce '3D Hands in the Wild' (3D-HIW), a dataset of 32K 3D hand-motion sequences and aligned text, and (2) propose CLUTCH, an LLM-based hand animation system with two critical innovations: (a) SHIFT, a novel VQ-VAE architecture to tokenize hand motion, and (b) a geometric refinement stage to finetune the LLM. To build 3D-HIW, we propose a data annotation pipeline that combines vision-language models (VLMs) and state-of-the-art 3D hand trackers, and apply it to a large corpus of egocentric action videos covering a wide range of scenarios. To fully capture motion in-the-wild, CLUTCH employs SHIFT, a part-modality decomposed VQ-VAE, which improves generalization and reconstruction fidelity. Finally, to improve animation quality, we introduce a geometric refinement stage, where CLUTCH is co-supervised with a reconstruction loss applied directly to decoded hand motion parameters. Experiments demonstrate state-of-the-art performance on text-to-motion and motion-to-text tasks, establishing the first benchmark for scalable in-the-wild hand motion modelling. Code, data and models will be released.

翻译：手部在日常生活中扮演着核心角色，然而对自然手部运动的建模研究仍显不足。现有解决文本到手部运动生成或手部动画描述任务的方法依赖于在受控工作室环境下采集的数据集，其动作与情境范围有限，难以扩展至开放场景。此外，当前模型及其训练方案难以同时保证动画保真度与文本-运动对齐性。为此，我们（1）提出了“开放场景三维手部运动数据集”（3D-HIW），包含3.2万条三维手部运动序列及其对齐文本；（2）设计了基于大语言模型的手部动画系统CLUTCH，其包含两项关键创新：（a）用于手部运动标记化的新型VQ-VAE架构SHIFT；（b）用于微调大语言模型的几何优化阶段。为构建3D-HIW，我们提出了结合视觉语言模型与前沿三维手部追踪器的数据标注流程，并将其应用于涵盖广泛场景的大规模第一人称动作视频库。为全面捕捉开放场景运动特征，CLUTCH采用部件模态解耦的VQ-VAE架构SHIFT，显著提升了泛化能力与重建保真度。最后，为优化动画质量，我们引入了几何优化阶段，通过对解码手部运动参数直接施加重建损失实现对CLUTCH的协同监督。实验表明，本方法在文本-运动生成与运动-文本描述任务上均达到最先进性能，首次建立了可扩展开放场景手部运动建模的基准。代码、数据与模型将公开释放。