BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation

Accurate evaluation is central to the large language model (LLM) ecosystem, guiding model selection and downstream adoption across diverse use cases. In practice, however, evaluating generative outputs typically relies on rigid lexical methods to extract and assess answers, which can conflate a model's true problem-solving ability with its compliance with predefined formatting guidelines. While recent LLM-as-a-Judge approaches mitigate this issue by assessing semantic correctness rather than strict structural conformity, they also introduce substantial computational overhead, making evaluation costly. In this work, we first systematically investigate the limitations of lexical evaluation through a large-scale empirical study spanning 36 models and 15 downstream tasks, demonstrating that such methods correlate poorly with human judgments. To address this limitation, we introduce BERT-as-a-Judge, an encoder-driven approach for assessing answer correctness in reference-based generative settings, robust to variations in output phrasing, and requiring only lightweight training on synthetically annotated question-candidate-reference triplets. We show that it consistently outperforms the lexical baseline while matching the performance of much larger LLM judges, providing a compelling tradeoff between the two and enabling reliable, scalable evaluation. Finally, through extensive experimentation, we provide detailed insights into BERT-as-a-Judge's performance to offer practical guidance for practitioners, and release all project artifacts to foster downstream adoption.

翻译：摘要：准确评估是大语言模型生态系统中的核心环节，对跨多种应用场景的模型选择与下游部署具有指导意义。然而实际评估生成式输出时，通常依赖僵化的词法方法提取并判定答案，这容易将模型真实的问题解决能力与其对预设格式要求的遵循程度相混淆。虽然后续的LLM-as-a-Judge方法通过评估语义正确性而非严格结构一致性缓解了此问题，但其引入了显著的计算开销，导致评估成本高昂。本研究首先通过涵盖36个模型与15项下游任务的大规模实证研究系统揭示了词法评估的局限性，证明此类方法与人类判断的相关性较弱。为解决这一局限，我们提出BERT-as-a-Judge——一种面向基于参考的生成式场景中答案正确性评估的编码器驱动方法，该方案对输出措辞变化具有鲁棒性，仅需基于合成标注的问题-候选答案-参考三元组进行轻量级训练。实验表明，该方法始终优于词法基线，同时匹配更大型LLM评委的性能表现，在两者间实现了具有吸引力的权衡，并支撑起可靠、可扩展的评估。最后，通过大量实验我们深入解析了BERT-as-a-Judge的性能特征以提供实践指导，并公开所有项目构件以促进下游应用。