On the Efficacy of Text-Based Input Modalities for Action Anticipation

Although the task of anticipating future actions is highly uncertain, information from additional modalities help to narrow down plausible action choices. Each modality provides different environmental context for the model to learn from. While previous multi-modal methods leverage information from modalities such as video and audio, we primarily explore how text inputs for actions and objects can also enable more accurate action anticipation. Therefore, we propose a Multi-modal Anticipative Transformer (MAT), an attention-based video transformer architecture that jointly learns from multi-modal features and text captions. We train our model in two-stages, where the model first learns to predict actions in the video clip by aligning with captions, and during the second stage, we fine-tune the model to predict future actions. Compared to existing methods, MAT has the advantage of learning additional environmental context from two kinds of text inputs: action descriptions during the pre-training stage, and the text inputs for detected objects and actions during modality feature fusion. Through extensive experiments, we evaluate the effectiveness of the pre-training stage, and show that our model outperforms previous methods on all datasets. In addition, we examine the impact of object and action information obtained via text and perform extensive ablations. We evaluate the performance on on three datasets: EpicKitchens-100, EpicKitchens-55 and EGTEA GAZE+; and show that text descriptions do indeed aid in more effective action anticipation.

翻译：尽管预测未来动作的任务具有高度不确定性，但来自其他模态的信息有助于缩小可能的动作选择范围。每种模态都为模型提供了不同的环境上下文以供学习。虽然以往的多模态方法利用视频和音频等模态的信息，我们主要探索动作和物体的文本输入如何能够实现更准确的动作预测。为此，我们提出了一种多模态预测Transformer（MAT），这是一种基于注意力机制的视频Transformer架构，能够从多模态特征和文本描述中联合学习。我们分两阶段训练模型：模型首先通过学习与文本描述的对齐来预测视频片段中的动作；在第二阶段，我们对模型进行微调以预测未来动作。与现有方法相比，MAT的优势在于能从两种文本输入中学习额外的环境上下文：预训练阶段的动作描述，以及模态特征融合阶段检测到的物体和动作的文本输入。通过大量实验，我们评估了预训练阶段的有效性，并表明我们的模型在所有数据集上均优于以往方法。此外，我们研究了通过文本获取的物体和动作信息的影响，并进行了广泛的消融实验。我们在三个数据集——EpicKitchens-100、EpicKitchens-55和EGTEA GAZE+——上评估性能，结果表明文本描述确实有助于实现更有效的动作预测。