Anticipating future actions in videos is challenging, as the observed frames provide only evidence of past activities, requiring the inference of latent intentions to predict upcoming actions. Existing transformer-based approaches, which rely on dot-product attention over pixel representations, often lack the high-level semantics necessary to model video sequences for effective action anticipation. As a result, these methods tend to overfit to explicit visual cues present in the past frames, limiting their ability to capture underlying intentions and degrading generalization to unseen samples. To address this, we propose Action-Guided Attention (AGA), an attention mechanism that explicitly leverages predicted action sequences as queries and keys to guide sequence modeling. Our approach fosters the attention module to emphasize relevant moments from the past based on the upcoming activity and combine this information with the current frame embedding via a dedicated gating function. The design of AGA enables post-training analysis of the knowledge discovered from the training set. Experiments on the widely adopted EPIC-Kitchens-100 benchmark demonstrate that AGA generalizes well from validation to unseen test sets. Post-training analysis can further examine the action dependencies captured by the model and the counterfactual evidence it has internalized, offering transparent and interpretable insights into its anticipative predictions.
翻译:视频中的未来动作预测具有挑战性,因为已观测帧仅提供过去活动的证据,需要推断潜在意图以预测即将发生的动作。现有基于Transformer的方法依赖于像素表示的点积注意力,通常缺乏对视频序列进行有效动作预测所需的高层语义。因此,这些方法容易过度拟合过去帧中存在的显式视觉线索,限制了其捕捉潜在意图的能力,并降低了对未见样本的泛化性能。为解决这一问题,我们提出了动作引导注意力机制(AGA),该注意力机制显式地利用预测的动作序列作为查询和键来指导序列建模。我们的方法促使注意力模块基于即将发生的活动强调过去的相关时刻,并通过专用的门控函数将此信息与当前帧嵌入相结合。AGA的设计支持对从训练集中发现的知识进行训练后分析。在广泛采用的EPIC-Kitchens-100基准测试上的实验表明,AGA从验证集到未见测试集均表现出良好的泛化能力。训练后分析可进一步检验模型捕获的动作依赖关系及其内化的反事实证据,从而为其预测性推断提供透明且可解释的见解。