Recently, multimodal large language models have made significant advancements in video understanding tasks. However, their ability to understand unprocessed long videos is very limited, primarily due to the difficulty in supporting the enormous memory overhead. Although existing methods achieve a balance between memory and information by aggregating frames, they inevitably introduce the severe hallucination issue. To address this issue, this paper constructs a comprehensive hallucination mitigation pipeline based on existing MLLMs. Specifically, we use the CLIP Score to guide the frame sampling process with questions, selecting key frames relevant to the question. Then, We inject question information into the queries of the image Q-former to obtain more important visual features. Finally, during the answer generation stage, we utilize chain-of-thought and in-context learning techniques to explicitly control the generation of answers. It is worth mentioning that for the breakpoint mode, we found that image understanding models achieved better results than video understanding models. Therefore, we aggregated the answers from both types of models using a comparison mechanism. Ultimately, We achieved 84.2\% and 62.9\% for the global and breakpoint modes respectively on the MovieChat dataset, surpassing the official baseline model by 29.1\% and 24.1\%. Moreover the proposed method won the third place in the CVPR LOVEU 2024 Long-Term Video Question Answering Challenge. The code is avaiable at https://github.com/lntzm/CVPR24Track-LongVideo
翻译:近年来,多模态大语言模型在视频理解任务中取得了显著进展。然而,其理解未经处理的长视频的能力非常有限,这主要归因于难以支撑巨大的内存开销。尽管现有方法通过聚合帧在内存与信息之间实现了平衡,但它们不可避免地引入了严重的幻觉问题。为解决此问题,本文基于现有多模态大语言模型构建了一个全面的幻觉缓解流程。具体而言,我们使用CLIP分数以问题为指导进行帧采样过程,选择与问题相关的关键帧。接着,我们将问题信息注入图像Q-former的查询中,以获得更重要的视觉特征。最后,在答案生成阶段,我们利用思维链和上下文学习技术来显式控制答案的生成。值得一提的是,对于断点模式,我们发现图像理解模型取得了比视频理解模型更好的结果。因此,我们通过比较机制聚合了这两类模型的答案。最终,我们在MovieChat数据集上的全局模式和断点模式分别达到了84.2%和62.9%的准确率,超越了官方基线模型29.1%和24.1%。此外,所提方法在CVPR LOVEU 2024长时视频问答挑战赛中获得了第三名。代码可在https://github.com/lntzm/CVPR24Track-LongVideo 获取。