While much effort has been invested in generating human motion from text, relatively few studies have been dedicated to the reverse direction, that is, generating text from motion. Much of the research focuses on maximizing generation quality without any regard for the interpretability of the architectures, particularly regarding the influence of particular body parts in the generation and the temporal synchronization of words with specific movements and actions. This study explores the combination of movement encoders with spatio-temporal attention models and proposes strategies to guide the attention during training to highlight perceptually pertinent areas of the skeleton in time. We show that adding guided attention with adaptive gate leads to interpretable captioning while improving performance compared to higher parameter-count non-interpretable SOTA systems. On the KIT MLD dataset, we obtain a BLEU@4 of 24.4% (SOTA+6%), a ROUGE-L of 58.30% (SOTA +14.1%), a CIDEr of 112.10 (SOTA +32.6) and a Bertscore of 41.20% (SOTA +18.20%). On HumanML3D, we obtain a BLEU@4 of 25.00 (SOTA +2.7%), a ROUGE-L score of 55.4% (SOTA +6.1%), a CIDEr of 61.6 (SOTA -10.9%), a Bertscore of 40.3% (SOTA +2.5%). Our code implementation and reproduction details will be soon available at https://github.com/rd20karim/M2T-Interpretable/tree/main.
翻译:尽管大量研究致力于从文本生成人体运动,但针对反向过程(即从运动生成文本)的研究相对较少。现有研究大多聚焦于最大化生成质量,而未关注架构的可解释性,特别是身体各部位在生成过程中的影响以及词语与特定动作/运动的时间同步性。本研究探索了运动编码器与时空注意力模型的组合,并提出在训练过程中引导注意力聚焦于骨骼在时间维度上的感知显著区域的策略。研究表明,通过自适应门控机制添加引导式注意力,不仅可实现可解释的字幕生成,还能提升性能,优于参数规模更大的非可解释性最先进系统。在KIT MLD数据集上,我们获得BLEU@4为24.4%(较最先进方法提升6%),ROUGE-L为58.30%(提升14.1%),CIDEr为112.10(提升32.6),Bertscore为41.20%(提升18.20%);在HumanML3D数据集上,BLEU@4为25.00(提升2.7%),ROUGE-L为55.4%(提升6.1%),CIDEr为61.6(下降10.9%),Bertscore为40.3%(提升2.5%)。我们的代码实现与复现细节将发布于https://github.com/rd20karim/M2T-Interpretable/tree/main。