Human-generated reward signals are critical for aligning generative models with human preferences, guiding both training and inference-time evaluations. While large language models (LLMs) employed as proxy evaluators, i.e., LLM-as-a-Judge, significantly reduce the costs associated with manual annotations, they typically require extensive modality-specific training data and fail to generalize well across diverse multimodal tasks. In this paper, we propose Flex-Judge, a reasoning-guided multimodal judge model that leverages minimal textual reasoning data to robustly generalize across multiple modalities and evaluation formats. Our core intuition is that structured textual reasoning explanations inherently encode generalizable decision-making patterns, enabling an effective transfer to multimodal judgments, e.g., with images or videos. Empirical results demonstrate that Flex-Judge, despite being trained on significantly fewer text data, achieves competitive or superior performance compared to state-of-the-art commercial APIs and extensively trained multimodal evaluators. Notably, Flex-Judge presents broad impact in modalities like molecule, where comprehensive evaluation benchmarks are scarce, underscoring its practical value in resource-constrained domains. Our framework highlights reasoning-based text supervision as a powerful, cost-effective alternative to traditional annotation-intensive approaches, substantially advancing scalable multimodal model-as-a-judge.
翻译:人类生成的奖励信号对于使生成模型与人类偏好对齐至关重要,它指导着训练和推理阶段的评估。虽然大型语言模型(LLMs)被用作代理评估器(即LLM-as-a-Judge)显著降低了人工标注的成本,但它们通常需要大量特定模态的训练数据,且难以在不同多模态任务中良好泛化。本文提出Flex-Judge,一种推理引导的多模态评判模型,它利用极少的文本推理数据,实现在多种模态和评估格式间的稳健泛化。我们的核心洞见是:结构化的文本推理解释本身编码了可泛化的决策模式,从而能有效迁移至多模态判断(例如涉及图像或视频的任务)。实证结果表明,尽管Flex-Judge仅使用显著更少的文本数据进行训练,其性能仍与最先进的商业API及经过大量训练的多模态评估器相当甚至更优。值得注意的是,Flex-Judge在分子等模态上展现出广泛影响力——这些领域通常缺乏全面的评估基准,这凸显了其在资源受限领域的实用价值。我们的框架表明,基于推理的文本监督是一种强大且经济高效的替代方案,可取代传统依赖密集标注的方法,从而显著推进可扩展的多模态模型即评估器的发展。