Generating natural and expressive human motions from textual descriptions is challenging due to the complexity of coordinating full-body dynamics and capturing nuanced motion patterns over extended sequences that accurately reflect the given text. To address this, we introduce BiPO, Bidirectional Partial Occlusion Network for Text-to-Motion Synthesis, a novel model that enhances text-to-motion synthesis by integrating part-based generation with a bidirectional autoregressive architecture. This integration allows BiPO to consider both past and future contexts during generation while enhancing detailed control over individual body parts without requiring ground-truth motion length. To relax the interdependency among body parts caused by the integration, we devise the Partial Occlusion technique, which probabilistically occludes the certain motion part information during training. In our comprehensive experiments, BiPO achieves state-of-the-art performance on the HumanML3D dataset, outperforming recent methods such as ParCo, MoMask, and BAMM in terms of FID scores and overall motion quality. Notably, BiPO excels not only in the text-to-motion generation task but also in motion editing tasks that synthesize motion based on partially generated motion sequences and textual descriptions. These results reveal the BiPO's effectiveness in advancing text-to-motion synthesis and its potential for practical applications.
翻译:从文本描述生成自然且富有表现力的人体动作具有挑战性,原因在于需要协调全身动力学并捕捉长序列中细微的动作模式,以准确反映给定文本。为解决此问题,我们提出了BiPO(用于文本到动作合成的双向部分遮挡网络),这是一种新颖的模型,通过将基于部位的动作生成与双向自回归架构相结合,增强了文本到动作合成能力。这种结合使BiPO在生成过程中能够同时考虑过去和未来的上下文,同时增强了对单个身体部位的精细控制,且无需真实动作长度信息。为了缓解因结合而产生的身体部位间的相互依赖性,我们设计了部分遮挡技术,该技术在训练过程中以概率方式遮挡特定动作部位的信息。在我们的综合实验中,BiPO在HumanML3D数据集上实现了最先进的性能,在FID分数和整体动作质量方面超越了ParCo、MoMask和BAMM等近期方法。值得注意的是,BiPO不仅在文本到动作生成任务中表现出色,在基于部分生成的动作序列和文本描述合成动作的编辑任务中也同样优异。这些结果揭示了BiPO在推进文本到动作合成方面的有效性及其在实际应用中的潜力。