Omni-Judge: Can Omni-LLMs Serve as Human-Aligned Judges for Text-Conditioned Audio-Video Generation?

State-of-the-art text-to-video generation models such as Sora 2 and Veo 3 can now produce high-fidelity videos with synchronized audio directly from a textual prompt, marking a new milestone in multi-modal generation. However, evaluating such tri-modal outputs remains an unsolved challenge. Human evaluation is reliable but costly and difficult to scale, while traditional automatic metrics, such as FVD, CLAP, and ViCLIP, focus on isolated modality pairs, struggle with complex prompts, and provide limited interpretability. Omni-modal large language models (omni-LLMs) present a promising alternative: they naturally process audio, video, and text, support rich reasoning, and offer interpretable chain-of-thought feedback. Driven by this, we introduce Omni-Judge, a study assessing whether omni-LLMs can serve as human-aligned judges for text-conditioned audio-video generation. Across nine perceptual and alignment metrics, Omni-Judge achieves correlation comparable to traditional metrics and excels on semantically demanding tasks such as audio-text alignment, video-text alignment, and audio-video-text coherence. It underperforms on high-FPS perceptual metrics, including video quality and audio-video synchronization, due to limited temporal resolution. Omni-Judge provides interpretable explanations that expose semantic or physical inconsistencies, enabling practical downstream uses such as feedback-based refinement. Our findings highlight both the potential and current limitations of omni-LLMs as unified evaluators for multi-modal generation.

翻译：Sora 2和Veo 3等最先进的文本到视频生成模型现已能够直接从文本提示生成具有同步音频的高保真视频，标志着多模态生成领域的新里程碑。然而，评估此类三模态输出仍是一个未解决的挑战。人工评估虽可靠但成本高昂且难以规模化，而传统的自动评估指标（如FVD、CLAP和ViCLIP）仅关注孤立的模态对组合，难以处理复杂提示且可解释性有限。全能模态大语言模型（omni-LLMs）提供了一种有前景的替代方案：它们能自然处理音频、视频和文本，支持丰富推理，并提供可解释的思维链反馈。基于此，我们提出Omni-Judge研究，旨在评估全能大语言模型能否作为文本条件音频-视频生成的人类对齐评估者。在九项感知与对齐指标上，Omni-Judge实现了与传统指标相当的相关系数，并在语义要求较高的任务（如音频-文本对齐、视频-文本对齐及音频-视频-文本连贯性）上表现优异。由于时间分辨率有限，该模型在高帧率感知指标（包括视频质量与音视频同步性）上表现欠佳。Omni-Judge提供的可解释性说明能揭示语义或物理不一致性，从而支持基于反馈的生成结果优化等实际下游应用。我们的研究结果既揭示了全能大语言模型作为多模态生成统一评估者的潜力，也明确了其当前局限性。