Affordance-centric Question-driven Task Completion (AQTC) for Egocentric Assistant introduces a groundbreaking scenario. In this scenario, through learning instructional videos, AI assistants provide users with step-by-step guidance on operating devices. In this paper, we present a solution for enhancing video alignment to improve multi-step inference. Specifically, we first utilize VideoCLIP to generate video-script alignment features. Afterwards, we ground the question-relevant content in instructional videos. Then, we reweight the multimodal context to emphasize prominent features. Finally, we adopt GRU to conduct multi-step inference. Through comprehensive experiments, we demonstrate the effectiveness and superiority of our method, which secured the 2nd place in CVPR'2023 AQTC challenge. Our code is available at https://github.com/zcfinal/LOVEU-CVPR23-AQTC.
翻译:面向自我中心助手的可供性驱动的问答任务完成(AQTC)引入了一个开创性场景。在该场景中,AI助手通过学习教学视频,为用户提供操作设备的逐步指导。本文提出了一种增强视频对齐以改进多步推理的解决方案。具体而言,我们首先利用VideoCLIP生成视频-脚本对齐特征;随后,在教学视频中定位与问题相关的内容;接着,对多模态上下文进行重加权以突出显著特征;最后,采用GRU进行多步推理。通过全面实验,我们证明了方法的有效性和优越性,该方法在CVPR'2023 AQTC挑战赛中取得了第二名。我们的代码已在https://github.com/zcfinal/LOVEU-CVPR23-AQTC 开源。