Achieving expressive 3D motion reconstruction and automatic generation for isolated sign words can be challenging, due to the lack of real-world 3D sign-word data, the complex nuances of signing motions, and the cross-modal understanding of sign language semantics. To address these challenges, we introduce SignAvatar, a framework capable of both word-level sign language reconstruction and generation. SignAvatar employs a transformer-based conditional variational autoencoder architecture, effectively establishing relationships across different semantic modalities. Additionally, this approach incorporates a curriculum learning strategy to enhance the model's robustness and generalization, resulting in more realistic motions. Furthermore, we contribute the ASL3DWord dataset, composed of 3D joint rotation data for the body, hands, and face, for unique sign words. We demonstrate the effectiveness of SignAvatar through extensive experiments, showcasing its superior reconstruction and automatic generation capabilities. The code and dataset are available on the project page.
翻译:针对孤立手语词汇,实现富有表现力的3D运动重建与自动生成极具挑战性,其原因包括真实世界3D手语词汇数据的缺失、手语动作的复杂细微差别以及手语语义的跨模态理解。为解决上述难题,我们提出SignAvatar框架,该框架既能实现词汇级手语重建,也能进行手语生成。SignAvatar采用基于Transformer的条件变分自编码器架构,有效建立了不同语义模态之间的关联。此外,该方法融入了课程学习策略以增强模型的鲁棒性与泛化能力,从而生成更逼真的运动序列。同时,我们贡献了ASL3DWord数据集,该数据集包含身体、手部和面部的3D关节旋转数据,覆盖各类独特手语词汇。通过大量实验,我们验证了SignAvatar的有效性,展示了其卓越的重建与自动生成能力。代码与数据集已发布于项目页面。