Despite remarkable progress in image generation models, generating realistic hands remains a persistent challenge due to their complex articulation, varying viewpoints, and frequent occlusions. We present FoundHand, a large-scale domain-specific diffusion model for synthesizing single and dual hand images. To train our model, we introduce FoundHand-10M, a large-scale hand dataset with 2D keypoints and segmentation mask annotations. Our insight is to use 2D hand keypoints as a universal representation that encodes both hand articulation and camera viewpoint. FoundHand learns from image pairs to capture physically plausible hand articulations, natively enables precise control through 2D keypoints, and supports appearance control. Our model exhibits core capabilities that include the ability to repose hands, transfer hand appearance, and even synthesize novel views. This leads to zero-shot capabilities for fixing malformed hands in previously generated images, or synthesizing hand video sequences. We present extensive experiments and evaluations that demonstrate state-of-the-art performance of our method.
翻译:尽管图像生成模型取得了显著进展,但由于手部结构复杂、视角多变且常被遮挡,生成逼真的手部图像仍是一个持续存在的挑战。我们提出了FoundHand,一个用于合成单/双手图像的大规模领域特定扩散模型。为了训练模型,我们引入了FoundHand-10M,这是一个包含2D关键点和分割掩码标注的大规模手部数据集。我们的核心思路是使用2D手部关键点作为通用表征,它同时编码了手部关节结构和相机视角信息。FoundHand通过图像对进行学习以捕捉物理上合理的手部姿态,原生支持通过2D关键点实现精确控制,并支持外观控制。我们的模型展现出核心能力,包括重摆姿态、迁移手部外观乃至合成新视角图像。这使其具备零样本能力,可修复先前生成图像中的畸形手部,或合成手部视频序列。我们通过大量实验和评估证明,该方法实现了最先进的性能。