Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced the comprehension of multimedia content, bringing together diverse modalities such as text, images, and videos. However, a critical challenge faced by these models, especially when processing video inputs, is the occurrence of hallucinations - erroneous perceptions or interpretations, particularly at the event level. This study introduces an innovative method to address event-level hallucinations in MLLMs, focusing on specific temporal understanding in video content. Our approach leverages a novel framework that extracts and utilizes event-specific information from both the event query and the provided video to refine MLLMs' response. We propose a unique mechanism that decomposes on-demand event queries into iconic actions. Subsequently, we employ models like CLIP and BLIP2 to predict specific timestamps for event occurrences. Our evaluation, conducted using the Charades-STA dataset, demonstrates a significant reduction in temporal hallucinations and an improvement in the quality of event-related responses. This research not only provides a new perspective in addressing a critical limitation of MLLMs but also contributes a quantitatively measurable method for evaluating MLLMs in the context of temporal-related questions.
翻译:近年来,多模态大语言模型(MLLMs)的进展显著提升了多媒体内容的理解能力,整合了文本、图像和视频等多种模态。然而,这些模型在处理视频输入时面临的一个关键挑战是幻觉现象——尤其是在事件层面上的错误感知或解读。本研究提出了一种创新方法,旨在解决MLLMs中的事件级幻觉问题,重点关注视频内容中的特定时序理解。我们的方法利用了一种新型框架,从事件查询和所提供的视频中提取并利用事件特定信息,以优化MLLMs的响应。我们提出了一种独特机制,将按需事件查询分解为典型动作。随后,我们采用CLIP和BLIP2等模型预测事件发生的时间戳。基于Charades-STA数据集的评估表明,该方法显著减少了时序幻觉,并提升了与事件相关响应的质量。这项研究不仅为解决MLLMs的关键局限性提供了新视角,还提出了一种用于评估MLLMs在时序相关问题中表现的定量可测方法。