There is growing interest in generating skeleton-based human motions from natural language descriptions. While most efforts have focused on developing better neural architectures for this task, there has been no significant work on determining the proper evaluation metric. Human evaluation is the ultimate accuracy measure for this task, and automated metrics should correlate well with human quality judgments. Since descriptions are compatible with many motions, determining the right metric is critical for evaluating and designing effective generative models. This paper systematically studies which metrics best align with human evaluations and proposes new metrics that align even better. Our findings indicate that none of the metrics currently used for this task show even a moderate correlation with human judgments on a sample level. However, for assessing average model performance, commonly used metrics such as R-Precision and less-used coordinate errors show strong correlations. Additionally, several recently developed metrics are not recommended due to their low correlation compared to alternatives. We also introduce a novel metric based on a multimodal BERT-like model, MoBERT, which offers strongly human-correlated sample-level evaluations while maintaining near-perfect model-level correlation. Our results demonstrate that this new metric exhibits extensive benefits over all current alternatives.
翻译:从自然语言描述生成基于骨架的人体动作正引起越来越多的关注。尽管大多数研究致力于为该任务开发更优的神经网络架构,但在确定合适的评估度量方面尚未有重要工作。人工评估是该任务最终的准确度衡量标准,而自动化度量应与人类质量判断高度相关。由于描述与多种动作兼容,确定正确的度量对于评估和设计有效的生成模型至关重要。本文系统研究了哪些度量与人工评估最为契合,并提出了与之更吻合的新度量。我们的发现表明,当前用于该任务的任何度量在样本层面与人类判断甚至未表现出中等相关性。然而,在评估平均模型性能时,常用度量如R-Precision以及较少使用的坐标误差表现出强相关性。此外,由于与替代方案相比相关性较低,一些近期开发的度量不被推荐。我们还引入了一种基于多模态类BERT模型MoBERT的新型度量,该度量在保持近乎完美的模型层面相关性的同时,提供了与人类高度相关的样本层面评估。我们的结果表明,这一新型度量相较于所有现有替代方案具有广泛优势。