Large language models (LLMs) are widely used as reference-free evaluators via prompting, but this "LLM-as-a-Judge" paradigm is costly, opaque, and sensitive to prompt design. In this work, we investigate whether smaller models can serve as efficient evaluators by leveraging internal representations instead of surface generation. We uncover a consistent empirical pattern: small LMs, despite with weak generative ability, encode rich evaluative signals in their hidden states. This motivates us to propose the Semantic Capacity Asymmetry Hypothesis: evaluation requires significantly less semantic capacity than generation and can be grounded in intermediate representations, suggesting that evaluation does not necessarily need to rely on large-scale generative models but can instead leverage latent features from smaller ones. Our findings motivate a paradigm shift from LLM-as-a-Judge to Representation-as-a-Judge, a decoding-free evaluation strategy that probes internal model structure rather than relying on prompted output. We instantiate this paradigm through INSPECTOR, a probing-based framework that predicts aspect-level evaluation scores from small model representations. Experiments on reasoning benchmarks (GSM8K, MATH, GPQA) show that INSPECTOR substantially outperforms prompting-based small LMs and closely approximates full LLM judges, while offering a more efficient, reliable, and interpretable alternative for scalable evaluation.
翻译:大型语言模型(LLM)通过提示词被广泛用作无需参考标准的评估工具,但这种"LLM-as-a-Judge"范式存在成本高昂、过程不透明且对提示设计敏感等问题。本研究探讨了小型模型能否通过利用内部表征而非表面文本来实现高效评估。我们发现了一个稳定的经验规律:尽管小语言模型生成能力较弱,但其隐藏状态中编码了丰富的评估信号。这促使我们提出语义容量不对称假说:评估任务所需的语义容量远低于生成任务,且可基于中间表征实现,表明评估不一定需要依赖大规模生成模型,而可以利用更小型模型的潜在特征。我们的发现推动评估范式从LLM-as-a-Judge转向Representation-as-a-Judge——一种无需解码的评估策略,通过探查模型内部结构而非依赖提示输出来实现评估。我们通过INSPECTOR框架实例化了该范式,这是一个基于探针的框架,可从小型模型表征中预测细粒度评估分数。在推理基准测试(GSM8K、MATH、GPQA)上的实验表明,INSPECTOR显著优于基于提示的小型语言模型,并能够接近完整LLM评判器的性能,同时为可扩展评估提供了更高效、可靠且可解释的替代方案。