We study object interaction anticipation in egocentric videos. This task requires an understanding of the spatiotemporal context formed by past actions on objects, coined action context. We propose TransFusion, a multimodal transformer-based architecture. It exploits the representational power of language by summarising the action context. TransFusion leverages pre-trained image captioning and vision-language models to extract the action context from past video frames. This action context together with the next video frame is processed by the multimodal fusion module to forecast the next object interaction. Our model enables more efficient end-to-end learning. The large pre-trained language models add common sense and a generalisation capability. Experiments on Ego4D and EPIC-KITCHENS-100 show the effectiveness of our multimodal fusion model. They also highlight the benefits of using language-based context summaries in a task where vision seems to suffice. Our method outperforms state-of-the-art approaches by 40.4% in relative terms in overall mAP on the Ego4D test set. We validate the effectiveness of TransFusion via experiments on EPIC-KITCHENS-100. Video and code are available at https://eth-ait.github.io/transfusion-proj/.
翻译:我们研究第一人称视频中的物体交互预测任务。该任务需要理解由过去作用于物体的动作所形成的时空上下文,我们将其称为动作上下文。为此提出基于多模态Transformer架构的TransFusion模型,通过总结动作上下文来利用语言的表征能力。TransFusion利用预训练图像描述与视觉语言模型,从历史视频帧中提取动作上下文。多模态融合模块将动作上下文与下一帧视频共同处理,预测即将发生的物体交互。该模型支持更高效的端到端学习,大规模预训练语言模型为其注入常识知识与泛化能力。在Ego4D和EPIC-KITCHENS-100数据集上的实验验证了多模态融合模型的有效性,同时揭示了在看似视觉足以胜任的任务中引入语言上下文摘要的显著优势。本方法在Ego4D测试集上的整体mAP相对提升了40.4%,超越现有最先进方法。我们通过EPIC-KITCHENS-100上的实验进一步验证了TransFusion的有效性。相关视频与代码已开源至https://eth-ait.github.io/transfusion-proj/。