Quantitative evaluation metrics have traditionally been pivotal in gauging the advancements of artificial intelligence systems, including large language models (LLMs). However, these metrics have inherent limitations. Given the intricate nature of real-world tasks, a single scalar to quantify and compare is insufficient to capture the fine-grained nuances of model behavior. Metrics serve only as a way to compare and benchmark models, and do not yield actionable diagnostics, thus making the model improvement process challenging. Model developers find themselves amid extensive manual efforts involving sifting through vast datasets and attempting hit-or-miss adjustments to training data or setups. In this work, we address the shortcomings of quantitative metrics by proposing QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement. QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights that when applied, accelerate model improvement. The insights are backed by a comprehensive dashboard with fine-grained visualizations and human-interpretable analyses. We corroborate the faithfulness of QualEval by demonstrating that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative on a challenging dialogue task (DialogSum) when compared to baselines. QualEval successfully increases the pace of model development, thus in essence serving as a data-scientist-in-a-box. Given the focus on critiquing and improving current evaluation metrics, our method serves as a refreshingly new technique for both model evaluation and improvement.
翻译:定量评估指标历来是衡量人工智能系统(包括大型语言模型)进展的关键工具。然而,这些指标存在固有局限性。由于现实任务的复杂性,用单一标量来量化与比较,难以捕捉模型行为的细微差异。指标仅作为模型比较与基准测试的手段,无法提供可操作的诊断信息,从而使得模型改进过程充满挑战。模型开发人员需投入大量人工,在庞大数据集中筛选数据,并尝试对训练数据或配置进行试错性调整。本研究针对定量指标的不足,提出QualEval方法,通过引入自动化定性评估来增强定量标量指标,以此作为模型改进的载体。QualEval利用强大大语言模型推理器及新型灵活线性规划求解器,生成可读性强的洞察,应用这些洞察可加速模型改进。这些洞察由涵盖细粒度可视化与可解释性分析的全面仪表盘支持。我们验证了QualEval的可靠性,例如,在具有挑战性的对话任务(DialogSum)上,利用其洞察可将Llama 2模型的绝对性能相对基线提升至多15个百分点。QualEval成功加快了模型开发节奏,本质上是"一站式数据科学家"。鉴于当前评估指标存在批判性与改进空间,本方法为模型评估与改进提供了崭新的技术路径。