Is the Text to Motion model robust? Recent advancements in Text to Motion models primarily stem from more accurate predictions of specific actions. However, the text modality typically relies solely on pre-trained Contrastive Language-Image Pretraining (CLIP) models. Our research has uncovered a significant issue with the text-to-motion model: its predictions often exhibit inconsistent outputs, resulting in vastly different or even incorrect poses when presented with semantically similar or identical text inputs. In this paper, we undertake an analysis to elucidate the underlying causes of this instability, establishing a clear link between the unpredictability of model outputs and the erratic attention patterns of the text encoder module. Consequently, we introduce a formal framework aimed at addressing this issue, which we term the Stable Text-to-Motion Framework (SATO). SATO consists of three modules, each dedicated to stable attention, stable prediction, and maintaining a balance between accuracy and robustness trade-off. We present a methodology for constructing an SATO that satisfies the stability of attention and prediction. To verify the stability of the model, we introduced a new textual synonym perturbation dataset based on HumanML3D and KIT-ML. Results show that SATO is significantly more stable against synonyms and other slight perturbations while keeping its high accuracy performance.
翻译:文本到运动模型是否稳健?近期文本到运动模型的进展主要源于对特定动作更精确的预测。然而,文本模态通常仅依赖预训练的对比语言-图像预训练(CLIP)模型。我们的研究发现文本到运动模型存在一个显著问题:当输入语义相似或完全相同的文本时,其预测结果常出现不一致输出,导致生成截然不同甚至错误的姿态。本文通过分析阐明这种不稳定性的根本原因,将模型输出的不可预测性与文本编码器模块不稳定的注意力模式建立明确联系。为此,我们提出一个正式框架来解决该问题,并将其命名为稳定文本到运动框架(SATO)。SATO由三个模块组成,分别负责稳定注意力、稳定预测以及维持准确性与稳健性之间的平衡。我们提出了一种构建满足注意力与预测稳定性的SATO的方法论。为验证模型稳定性,我们基于HumanML3D和KIT-ML数据集引入了新的文本同义词扰动数据集。结果表明,SATO在同义词及其他轻微扰动下表现出显著更高的稳定性,同时保持高精度性能。