We present AnyHand, a large-scale synthetic dataset designed to advance the state of the art in 3D hand pose estimation from both RGB-only and RGB-D inputs. While recent works with foundation approaches have shown that an increase in the quantity and diversity of training data can markedly improve performance and robustness in hand pose estimation, existing real-world-collected datasets on this task are limited in coverage, and prior synthetic datasets rarely provide occlusions, arm details, and aligned depth together at scale. To address this bottleneck, our AnyHand contains 2.5M single-hand and 4.1M hand-object interaction RGB-D images, with rich geometric annotations. In the RGB-only setting, we show that extending the original training sets of existing baselines with AnyHand yields significant gains on multiple benchmarks (FreiHAND and HO-3D), even when keeping the architecture and training scheme fixed. More impressively, the model trained with AnyHand shows stronger generalization to the out-of-domain HO-Cap dataset, without any fine-tuning. We also contribute a lightweight depth fusion module that can be easily integrated into existing RGB-based models. Trained with AnyHand, the resulting RGB-D model achieves superior performance on the HO-3D benchmark, showing the benefits of depth integration and the effectiveness of our synthetic data.
翻译:我们提出AnyHand,一个旨在推动仅基于RGB及RGB-D输入的3D手部姿态估计技术发展的大规模合成数据集。尽管近期基于基础方法的研究表明,训练数据数量与多样性的提升能显著改善手部姿态估计的性能与鲁棒性,但现有真实世界采集的相关数据集覆盖范围有限,而此前合成数据集极少能同时大规模提供遮挡、手臂细节及对齐深度信息。为解决这一瓶颈,本数据集包含250万张单手图像与410万张手物交互RGB-D图像,并附带丰富的几何标注。在仅含RGB的场景下,即便保持网络架构与训练方案不变,使用AnyHand扩充现有基线模型原始训练集后,其在多个基准(FreiHAND与HO-3D)上均取得显著性能提升。更令人瞩目的是,基于AnyHand训练的模型无需微调即展现出对域外HO-Cap数据集更强的泛化能力。我们还提出一种轻量级深度融合模块,可便捷集成至现有基于RGB的模型中。经AnyHand训练后,所生成的RGB-D模型在HO-3D基准上表现卓越,充分证明了深度信息整合的优越性以及合成数据的有效性。