Cross-Lingual Stability of LLM Judges Under Controlled Generation: Evidence from Finno-Ugric Languages

Cross-lingual evaluation of large language models (LLMs) typically conflates two sources of variance: genuine model performance differences and measurement instability. We investigate evaluation reliability by holding generation conditions constant while varying target language. Using synthetic customer-support dialogues generated with identical parameters across Estonian, Finnish, and Hungarian, we test whether automatic metrics and LLM-as-a-judge scoring produce stable model rankings across these morphologically rich, related Finno-Ugric languages. With a small set of Estonian native speaker annotations as a reference point, we find systematic ranking instabilities: surface-level metrics (lexical diversity, surface and semantic similarity) maintain cross-language stability, but pragmatic judgments (coherence, instruction-following) exhibit rank inversions and near-zero correlations. Because generation is controlled, these inconsistencies reflect how judge scoring behaves differently across languages rather than true model differences. This controlled design provides a diagnostic probe: evaluation methods that fail to maintain stability under identical generation conditions signal transfer failure before deployment. Our findings suggest that zero-shot judge transfer is unreliable for discourse-level assessment in morphologically rich languages, motivating language-specific calibration against targeted human baselines. We release our controlled generation protocol, synthetic data, and evaluation framework to enable replication across language families at https://github.com/isaac-chung/cross-lingual-stability-judges.

翻译：大型语言模型（LLM）的跨语言评估通常混淆了两种变异来源：真实的模型性能差异与测量不稳定性。本研究通过保持生成条件恒定而改变目标语言，探究评估的可靠性。我们使用在爱沙尼亚语、芬兰语和匈牙利语中采用相同参数生成的合成客户支持对话，测试自动度量指标及LLM作为评判者的评分是否能在这些形态丰富、彼此相关的芬兰-乌戈尔语族语言间产生稳定的模型排序。以少量爱沙尼亚语母语者的标注作为参照点，我们发现系统性的排序不稳定性：表层指标（词汇多样性、表层及语义相似性）保持跨语言稳定性，但语用判断（连贯性、指令遵循）表现出排序反转及接近零相关性。由于生成过程受控，这些不一致性反映的是评判者评分在不同语言间的行为差异，而非真实的模型性能差别。这一受控设计提供了一种诊断性探查手段：在相同生成条件下无法保持稳定性的评估方法，预示着在部署前即存在迁移失败。我们的研究结果表明，零样本评判者迁移对于形态丰富语言的语篇层面评估并不可靠，这推动了对特定语言进行基于目标人工基线的校准。我们在https://github.com/isaac-chung/cross-lingual-stability-judges 发布了受控生成协议、合成数据及评估框架，以支持在不同语系中进行复现研究。