Composing simple elements into complex concepts is crucial yet challenging, especially for 3D action generation. Existing methods largely rely on extensive neural language annotations to discern composable latent semantics, a process that is often costly and labor-intensive. In this study, we introduce a novel framework to generate compositional actions without reliance on language auxiliaries. Our approach consists of three main components: Action Coupling, Conditional Action Generation, and Decoupling Refinement. Action Coupling utilizes an energy model to extract the attention masks of each sub-action, subsequently integrating two actions using these attentions to generate pseudo-training examples. Then, we employ a conditional generative model, CVAE, to learn a latent space, facilitating the diverse generation. Finally, we propose Decoupling Refinement, which leverages a self-supervised pre-trained model MAE to ensure semantic consistency between the sub-actions and compositional actions. This refinement process involves rendering generated 3D actions into 2D space, decoupling these images into two sub-segments, using the MAE model to restore the complete image from sub-segments, and constraining the recovered images to match images rendered from raw sub-actions. Due to the lack of existing datasets containing both sub-actions and compositional actions, we created two new datasets, named HumanAct-C and UESTC-C, and present a corresponding evaluation metric. Both qualitative and quantitative assessments are conducted to show our efficacy.
翻译:将简单元素组合成复杂概念至关重要且具有挑战性,尤其是在三维动作生成领域。现有方法大多依赖大量神经语言标注来识别可组合的潜在语义,这一过程往往成本高昂且劳动密集。本研究提出一种无需语言辅助即可生成组合动作的新型框架。该框架包含三个核心组件:动作耦合、条件动作生成和解耦优化。动作耦合利用能量模型提取每个子动作的注意力掩码,随后通过注意力机制融合两个动作生成伪训练样本。接着,我们采用条件生成模型CVAE学习潜在空间,实现多样化生成。最终提出解耦优化方法,利用自监督预训练模型MAE确保子动作与组合动作之间的语义一致性。该优化过程包括:将生成的3D动作渲染至2D空间、将图像解耦为两个子片段、使用MAE模型从子片段恢复完整图像,并约束恢复图像与原始子动作渲染图像相匹配。鉴于现有数据集均未同时包含子动作与组合动作,我们构建了HumanAct-C和UESTC-C两个新数据集,并提出了相应评估指标。通过定性与定量评估验证了本方法的有效性。