Evaluating Large Language Models (LLMs) is a complex task, especially considering the intricacies of natural language understanding and the expectations for high-level reasoning. Traditional evaluations typically lean on human-based, model-based, or automatic-metrics-based paradigms, each with its own advantages and shortcomings. We introduce "Fusion-Eval", a system that employs LLMs not solely for direct evaluations, but to skillfully integrate insights from diverse evaluators. This gives Fusion-Eval flexibility, enabling it to work effectively across diverse tasks and make optimal use of multiple references. In testing on the SummEval dataset, Fusion-Eval achieved a Spearman correlation of 0.96, outperforming other evaluators. The success of Fusion-Eval underscores the potential of LLMs to produce evaluations that closely align human perspectives, setting a new standard in the field of LLM evaluation.
翻译:评估大型语言模型(LLMs)是一项复杂的任务,尤其需要考虑自然语言理解的复杂性和对高层次推理的期待。传统评估通常依赖于人工评估、模型评估或自动指标评估范式,每种方法均有其优势与不足。我们提出"Fusion-Eval"系统,该系统并非单纯利用LLMs进行直接评估,而是巧妙整合来自不同评估者的洞察。这赋予了Fusion-Eval灵活性,使其能跨多种任务有效运作,并优化利用多个参考标准。在SummEval数据集上的测试中,Fusion-Eval取得了0.96的斯皮尔曼相关系数,超越其他评估者。Fusion-Eval的成功凸显了LLMs在生成与人类视角高度一致的评估结果方面的潜力,为LLM评估领域树立了新标杆。