In contrast to conventional visual question answering, video-grounded dialog necessitates a profound understanding of both dialog history and video content for accurate response generation. Despite commendable strides made by existing methodologies, they often grapple with the challenges of incrementally understanding intricate dialog histories and assimilating video information. In response to this gap, we present an iterative tracking and reasoning strategy that amalgamates a textual encoder, a visual encoder, and a generator. At its core, our textual encoder is fortified with a path tracking and aggregation mechanism, adept at gleaning nuances from dialog history that are pivotal to deciphering the posed questions. Concurrently, our visual encoder harnesses an iterative reasoning network, meticulously crafted to distill and emphasize critical visual markers from videos, enhancing the depth of visual comprehension. Culminating this enriched information, we employ the pre-trained GPT-2 model as our response generator, stitching together coherent and contextually apt answers. Our empirical assessments, conducted on two renowned datasets, testify to the prowess and adaptability of our proposed design.
翻译:与传统的视觉问答不同,视频对话需要深入理解对话历史和视频内容才能生成准确的回复。尽管现有方法已取得可喜进展,但在逐步理解复杂对话历史和整合视频信息方面仍面临挑战。针对这一不足,我们提出了一种融合文本编码器、视觉编码器和生成器的迭代跟踪与推理策略。其核心在于:我们的文本编码器配备了路径跟踪与聚合机制,能够有效从对话历史中提取对理解所提问题至关重要的细微语义信息;同时,视觉编码器采用精心设计的迭代推理网络,专注于从视频中提炼并强化关键视觉标记,从而提升视觉理解的深度。最终,我们利用预训练的GPT-2模型作为回复生成器,将丰富后的信息整合成连贯且符合上下文的答案。在两个公开数据集上的实证评估验证了所提方案的有效性与泛化能力。