We introduce MILE-RefHumEval, a reference-free framework for evaluating Large Language Models (LLMs) without ground-truth annotations or evaluator coordination. It leverages an ensemble of independently prompted evaluators guided by a human-aligned schema, supporting both discrete and continuous scoring judgement. With task-specific prompts from best candidate selection, summarization and image captioning to dialogue, MILE-RefHumEval provides flexible, interpretable, and scalable assessments. Experiments show it aligns closely with human judgments, outperforms prior methods, and reduces computational overhead, offering an efficient, robust, and human-aligned solution for real-world LLM evaluation.
翻译:我们提出了MILE-RefHumEval,这是一个无需参考的框架,用于在没有真实标注或评估器协调的情况下评估大语言模型。它利用一组由人类对齐模式引导的、独立提示的评估器集合,支持离散和连续两种评分判断。从最佳候选选择、摘要和图像描述到对话,MILE-RefHumEval通过任务特定的提示,提供了灵活、可解释且可扩展的评估。实验表明,该框架与人类判断高度一致,优于先前方法,并降低了计算开销,为现实世界中的大语言模型评估提供了一个高效、鲁棒且人类对齐的解决方案。