Affordance-Centric Question-driven Task Completion (AQTC) has been proposed to acquire knowledge from videos to furnish users with comprehensive and systematic instructions. However, existing methods have hitherto neglected the necessity of aligning spatiotemporal visual and linguistic signals, as well as the crucial interactional information between humans and objects. To tackle these limitations, we propose to combine large-scale pre-trained vision-language and video-language models, which serve to contribute stable and reliable multimodal data and facilitate effective spatiotemporal visual-textual alignment. Additionally, a novel hand-object-interaction (HOI) aggregation module is proposed which aids in capturing human-object interaction information, thereby further augmenting the capacity to understand the presented scenario. Our method achieved first place in the CVPR'2023 AQTC Challenge, with a Recall@1 score of 78.7\%. The code is available at https://github.com/tomchen-ctj/CVPR23-LOVEU-AQTC.
翻译:为了从视频中获取知识,为用户提供全面且系统的指导,人们提出了基于功能引导的问题驱动任务完成(AQTC)方法。然而,现有方法至今忽略了时空视觉与语言信号对齐的必要性,以及人与物体之间关键交互信息的整合。为解决这些局限,我们提出结合大规模预训练的视觉-语言模型和视频-语言模型,这些模型有助于提供稳定可靠的多模态数据,并促进有效的时空视觉-文本对齐。此外,我们提出了一种新颖的手-物交互(HOI)聚合模块,有助于捕捉人-物交互信息,从而进一步提升对呈现场景的理解能力。我们的方法在CVPR'2023 AQTC挑战赛中荣获第一名,Recall@1得分为78.7%。代码可在https://github.com/tomchen-ctj/CVPR23-LOVEU-AQTC获取。