The rapid advancement of photorealistic Text-to-Video (T2V) generation brings in an urgent need for up-to-date evaluation methods. Existing benchmarks largely overlooked implausible scenarios and do not measure audio-visual alignment. We introduce BRITE, the first framework that unifies (1) implausible prompting, (2) fine-grained assessment of audio-visual consistency, and (3) QA-based interpretable evaluation into a comprehensive T2V benchmark. Unlike fully automated Multimodal LLM-based pipelines, which are prone to hallucination and prompt ambiguity, BRITE guarantees reliability through a rigorous human-in-the-loop protocol for benchmark creation. Evaluating five state-of-the-art models (Sora 2, Veo 3.1, Runway Gen4.5, Pixverse V5.5, and Qwen3Max), we reveal a critical performance gap: while models excel at static object composition, they exhibit significant degradation in object-action binding and audio-visual synchronization. Our framework offers the community a reliable, interpretable benchmark and evaluation framework that can detect and locate limitations in the next generation of T2V models, especially for off-manifold prompts
翻译:逼真文本到视频(T2V)生成技术的快速发展,催生了对最新评估方法的迫切需求。现有基准大多忽略不合理场景,且未测量视听对齐性。我们提出BRITE——首个将(1)不合理提示设计、(2)视听一致性的细粒度评估以及(3)基于问答的可解释评估统一为综合T2V基准的框架。与易产生幻觉和提示歧义的完全自动化多模态大语言模型管道不同,BRITE通过严格的人机协同协议确保基准构建的可靠性。对五种先进模型(Sora 2、Veo 3.1、Runway Gen4.5、Pixverse V5.5和Qwen3Max)的评估揭示了关键性能差距:尽管模型在静态对象组合方面表现优异,但在对象-动作绑定和视听同步方面存在显著退化。本框架为社区提供了可靠且可解释的基准与评估体系,能够检测并定位下一代T2V模型的局限,尤其适用于流形外提示场景。