Sora Detector: A Unified Hallucination Detection for Large Text-to-Video Models

The rapid advancement in text-to-video (T2V) generative models has enabled the synthesis of high-fidelity video content guided by textual descriptions. Despite this significant progress, these models are often susceptible to hallucination, generating contents that contradict the input text, which poses a challenge to their reliability and practical deployment. To address this critical issue, we introduce the SoraDetector, a novel unified framework designed to detect hallucinations across diverse large T2V models, including the cutting-edge Sora model. Our framework is built upon a comprehensive analysis of hallucination phenomena, categorizing them based on their manifestation in the video content. Leveraging the state-of-the-art keyframe extraction techniques and multimodal large language models, SoraDetector first evaluates the consistency between extracted video content summary and textual prompts, then constructs static and dynamic knowledge graphs (KGs) from frames to detect hallucination both in single frames and across frames. Sora Detector provides a robust and quantifiable measure of consistency, static and dynamic hallucination. In addition, we have developed the Sora Detector Agent to automate the hallucination detection process and generate a complete video quality report for each input video. Lastly, we present a novel meta-evaluation benchmark, T2VHaluBench, meticulously crafted to facilitate the evaluation of advancements in T2V hallucination detection. Through extensive experiments on videos generated by Sora and other large T2V models, we demonstrate the efficacy of our approach in accurately detecting hallucinations. The code and dataset can be accessed via GitHub.

翻译：文生视频（T2V）生成模型的快速进步使得根据文本描述合成高保真视频内容成为可能。尽管取得了显著进展，但这些模型仍容易产生幻觉，生成与输入文本相矛盾的内容，这对其可靠性和实际部署构成了挑战。为解决这一关键问题，我们提出了SoraDetector——一个新颖的统一框架，旨在检测多种大型T2V模型（包括前沿的Sora模型）中的幻觉现象。该框架基于对幻觉现象的综合分析，根据其在视频内容中的表现形式对幻觉进行分类。通过利用最先进的的关键帧提取技术和多模态大语言模型，SoraDetector首先评估提取的视频内容摘要与文本提示之间的一致性，随后从帧中构建静态与动态知识图谱，以检测单帧内及跨帧的幻觉。该检测器提供了一致性、静态幻觉与动态幻觉的鲁棒量化度量指标。此外，我们开发了Sora Detector Agent来自动化幻觉检测流程，并为每个输入视频生成完整的视频质量报告。最后，我们提出了新型元评估基准T2VHaluBench，该基准经精心设计以促进T2V幻觉检测领域进展的评估。通过在Sora及其他大型T2V模型生成的视频上进行大量实验，我们证明了该方法在准确检测幻觉方面的有效性。代码与数据集可通过GitHub获取。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/