The rapid advancement of photorealistic Text-to-Video (T2V) generation brings in an urgent need for up-to-date evaluation methods. Existing benchmarks largely overlooked implausible scenarios and do not measure audio-visual alignment. We introduce BRITE, the first framework that unifies (1) implausible prompting, (2) fine-grained assessment of audio-visual consistency, and (3) QA-based interpretable evaluation into a comprehensive T2V benchmark. Unlike fully automated Multimodal LLM-based pipelines, which are prone to hallucination and prompt ambiguity, BRITE guarantees reliability through a rigorous human-in-the-loop protocol for benchmark creation. Evaluating five state-of-the-art models (Sora 2, Veo 3.1, Runway Gen4.5, Pixverse V5.5, and Qwen3Max), we reveal a critical performance gap: while models excel at static object composition, they exhibit significant degradation in object-action binding and audio-visual synchronization. Our framework offers the community a reliable, interpretable benchmark and evaluation framework that can detect and locate limitations in the next generation of T2V models, especially for off-manifold prompts
翻译:高写实度文生视频(Text-to-Video, T2V)技术的快速进展迫切需求与时俱进的评估方法。现有基准大多忽视不可信场景,且未衡量音视频对齐能力。我们提出BRITE,这是首个将(1)不可信提示、(2)音视频一致性细粒度评估、(3)基于问答的可解释评估统一为综合性T2V基准的框架。与完全自动化的多模态大语言模型(Multimodal LLM)流水线(易出现幻觉和提示歧义)不同,BRITE通过严格的人类闭环协议保障基准构建的可靠性。评估五款前沿模型(Sora 2、Veo 3.1、Runway Gen4.5、Pixverse V5.5、Qwen3Max)后,我们发现关键性能差距:模型虽在静态物体合成上表现优异,但物体-动作绑定与音视频同步能力显著退化。本框架为社区提供了可靠、可解释的基准与评估体系,能够检测并定位下一代T2V模型的局限,尤其针对流形外提示。