Video Question Answering (VideoQA) is an important research direction in the field of artificial intelligence, enabling machines to understand video content and perform reasoning and answering based on natural language questions. Although methods based on static relationship reasoning have made certain progress, there are still deficiencies in the accuracy of static relationship recognition and representation, and they have not fully utilized the static relationship information in videos for in-depth reasoning and analysis. Therefore, this paper proposes a reasoning method for intra-type and inter-type message passing based on static relationships. This method constructs a dual graph for intra-type message passing reasoning and builds a heterogeneous graph based on static relationships for inter-type message passing reasoning. The intra-type message passing reasoning model captures the neighborhood information of targets and relationships related to the question in the dual graph, updating the dual graph to obtain intra-type clues for answering the question. The inter-type message passing reasoning model captures the neighborhood information of targets and relationships from different categories related to the question in the heterogeneous graph, updating the heterogeneous graph to obtain inter-type clues for answering the question. Finally, the answers are inferred by combining the intra-type and inter-type clues based on static relationships. Experimental results on the ANetQA and Next-QA datasets demonstrate the effectiveness of this method.
翻译:视频问答是人工智能领域的重要研究方向,旨在使机器能够理解视频内容,并根据自然语言问题执行推理与回答。尽管基于静态关系推理的方法已取得一定进展,但在静态关系识别与表示的准确性方面仍存在不足,且未能充分利用视频中的静态关系信息进行深入推理分析。为此,本文提出一种基于静态关系的类型内与类型间消息传递推理方法。该方法构建双图进行类型内消息传递推理,并基于静态关系构建异构图进行类型间消息传递推理。类型内消息传递推理模型在双图中捕获与问题相关的目标及关系的邻域信息,通过更新双图获得回答问题的类型内线索;类型间消息传递推理模型在异构图中捕获与问题相关的不同类别的目标及关系的邻域信息,通过更新异构图获得回答问题的类型间线索。最终,结合基于静态关系的类型内与类型间线索进行答案推断。在ANetQA和Next-QA数据集上的实验结果验证了该方法的有效性。