Validating evaluation metrics for NLG typically relies on expensive and time-consuming human annotations, which predominantly exist only for English datasets. We propose LLM as a Meta-Judge, a scalable framework that utilizes LLMs to generate synthetic evaluation datasets via controlled semantic degradation of real data, replacing human judgment. We validate our approach using meta-correlation, measuring the alignment between metric rankings derived from synthetic data and those from standard human benchmarks. Experiments across Machine Translation, Question Answering, and Summarization demonstrate that synthetic validation serves as a reliable proxy for human judgment, achieving meta-correlations exceeding 0.9 in multilingual QA and proves to be a viable alternative where human judgments are unavailable or too expensive to obtain. Our code and data are publicly available at https://github.com/eiglerl/meta-judge.
翻译:自然语言生成评估指标的验证通常依赖昂贵且耗时的人工标注,且此类标注主要存在于英语数据集。我们提出"LLM作为元评判者"框架,通过可控语义退化对真实数据进行处理,利用大语言模型生成合成评估数据集以替代人工判断。我们采用元相关性验证方法,衡量基于合成数据与标准人工基准获得的指标排序之间的一致性。在机器翻译、问答和文本摘要任务上的实验表明,合成验证可作为人工判断的可靠替代方案:在多语言问答任务中元相关性超过0.9,验证了其在缺乏人工标注或标注成本过高场景下的可行性。我们的代码与数据已开源至https://github.com/eiglerl/meta-judge。