Language-guided scene-aware human motion generation has great significance for entertainment and robotics. In response to the limitations of existing datasets, we introduce LaserHuman, a pioneering dataset engineered to revolutionize Scene-Text-to-Motion research. LaserHuman stands out with its inclusion of genuine human motions within 3D environments, unbounded free-form natural language descriptions, a blend of indoor and outdoor scenarios, and dynamic, ever-changing scenes. Diverse modalities of capture data and rich annotations present great opportunities for the research of conditional motion generation, and can also facilitate the development of real-life applications. Moreover, to generate semantically consistent and physically plausible human motions, we propose a multi-conditional diffusion model, which is simple but effective, achieving state-of-the-art performance on existing datasets.
翻译:摘要:语言引导的场景感知人体运动生成在娱乐和机器人领域具有重要意义。针对现有数据集的局限性,我们引入了LaserHuman,这是一个专为革新场景文本到运动研究而设计的开创性数据集。LaserHuman的独特之处在于其包含三维环境中的真实人体运动、无边界自由形式自然语言描述、室内外场景的混合以及动态变化的场景。多模态捕获数据和丰富的标注为条件运动生成研究提供了巨大机遇,同时也有助于推动现实应用的发展。此外,为了生成语义一致且物理上合理的人体运动,我们提出了一种简单而有效的多条件扩散模型,在现有数据集上实现了最先进的性能。