Large language models (LLMs) are now widely used as automatic judges for open-ended instruction-following evaluation. This practice is convenient, scalable, and often more semantically aware than reference-based metrics, but it also introduces a new reliability question: does a judge evaluate the quality of an answer, or does it also react to the language in which the comparison is presented? We propose Judge-LS, a lightweight meta-evaluation protocol that transforms LLMBar response-pair items into English, Chinese, and Chinese-English language-switched variants. A reliable judge should preserve its preference under label-preserving language transformations and should not prefer a language when two answers are translation-equivalent. We evaluate four API-accessible judges on the full 419-item LLMBar benchmark, producing 13,408 successful pairwise judgments. Across models, Chinese and language-switched presentations induce 10.7--14.4% preference flips relative to English, and all judges achieve their highest accuracy in English. However, translation-equivalent tie probes do not reveal a systematic English preference: most probes are judged as ties, and non-tie decisions more often favor Chinese. We add confidence intervals, paired significance tests, and an automatic transformation audit with a sensitivity analysis that excludes mechanically flagged high-risk variants. The experiment requires no model training, uses only API calls, and is feasible on modest local hardware.
翻译:大型语言模型(LLMs)现已被广泛用作开放式指令遵循评估的自动裁判。这种实践虽便捷、可扩展,且通常比基于参考的指标更具语义感知能力,但也引入了一个新的可靠性问题:裁判是在评估答案的质量,还是也对比较呈现的语言产生反应?我们提出Judge-LS,一种轻量级元评估协议,将LLMBar基准中的回答-配对项转换为英语、汉语以及汉语-英语语言切换变体。一个可靠的裁判应在保持标签的语言变换下保持其偏好,且当两个答案为翻译等价时不应偏好某种语言。我们在完整的419项LLMBar基准上评估了四个可API访问的裁判,生成了13,408次成功的成对判断。在不同模型中,汉语及语言切换呈现方式相较于英语引发了10.7%至14.4%的偏好翻转,所有裁判均在英语环境下达到最高准确率。然而,翻译等价的平局探测并未揭示系统性的英语偏好:多数探测结果为平局,而非平局决策更常偏向汉语。我们加入了置信区间、配对显著性检验,以及通过灵敏度分析剔除机械标记的高风险变体的自动转换审计。本实验无需模型训练,仅使用API调用,且可在适度的本地计算资源上完成。