Generating natural and expressive human motions from textual descriptions is challenging due to the complexity of coordinating full-body dynamics and capturing nuanced motion patterns over extended sequences that accurately reflect the given text. To address this, we introduce BiPO, Bidirectional Partial Occlusion Network for Text-to-Motion Synthesis, a novel model that enhances text-to-motion synthesis by integrating part-based generation with a bidirectional autoregressive architecture. This integration allows BiPO to consider both past and future contexts during generation while enhancing detailed control over individual body parts without requiring ground-truth motion length. To relax the interdependency among body parts caused by the integration, we devise the Partial Occlusion technique, which probabilistically occludes the certain motion part information during training. In our comprehensive experiments, BiPO achieves state-of-the-art performance on the HumanML3D dataset, outperforming recent methods such as ParCo, MoMask, and BAMM in terms of FID scores and overall motion quality. Notably, BiPO excels not only in the text-to-motion generation task but also in motion editing tasks that synthesize motion based on partially generated motion sequences and textual descriptions. These results reveal the BiPO's effectiveness in advancing text-to-motion synthesis and its potential for practical applications.
翻译:从文本描述生成自然且富有表现力的人体动作具有挑战性,原因在于协调全身动力学以及捕捉能够准确反映给定文本的、在长序列中呈现的细微动作模式具有复杂性。为解决此问题,我们提出了BiPO(用于文本到动作合成的双向部分遮挡网络),这是一种新颖的模型,它通过将基于身体部位的生成与双向自回归架构相结合,来增强文本到动作合成。这种结合使得BiPO在生成过程中能够同时考虑过去和未来的上下文,并在无需真实动作长度的情况下,增强对单个身体部位的精细控制。为了缓解因结合而产生的身体部位间的相互依赖性,我们设计了部分遮挡技术,该技术在训练期间以概率方式遮挡特定的动作部位信息。在我们全面的实验中,BiPO在HumanML3D数据集上取得了最先进的性能,在FID分数和整体动作质量方面超越了ParCo、MoMask和BAMM等近期方法。值得注意的是,BiPO不仅在文本到动作生成任务中表现出色,在基于部分生成的动作序列和文本描述合成动作的动作编辑任务中也同样卓越。这些结果揭示了BiPO在推进文本到动作合成方面的有效性及其在实际应用中的潜力。