QualEval: Qualitative Evaluation for Model Improvement

Quantitative evaluation metrics have traditionally been pivotal in gauging the advancements of artificial intelligence systems, including large language models (LLMs). However, these metrics have inherent limitations. Given the intricate nature of real-world tasks, a single scalar to quantify and compare is insufficient to capture the fine-grained nuances of model behavior. Metrics serve only as a way to compare and benchmark models, and do not yield actionable diagnostics, thus making the model improvement process challenging. Model developers find themselves amid extensive manual efforts involving sifting through vast datasets and attempting hit-or-miss adjustments to training data or setups. In this work, we address the shortcomings of quantitative metrics by proposing QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement. QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights that when applied, accelerate model improvement. The insights are backed by a comprehensive dashboard with fine-grained visualizations and human-interpretable analyses. We corroborate the faithfulness of QualEval by demonstrating that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative on a challenging dialogue task (DialogSum) when compared to baselines. QualEval successfully increases the pace of model development, thus in essence serving as a data-scientist-in-a-box. Given the focus on critiquing and improving current evaluation metrics, our method serves as a refreshingly new technique for both model evaluation and improvement.

翻译：摘要：定量评估指标历来是衡量人工智能系统（包括大型语言模型）进展的核心工具，但这些指标存在固有局限性。鉴于现实世界任务的复杂性，单一标量量化与比较不足以捕捉模型行为的细微差异。指标仅作为比较和基准测试模型的手段，无法提供可操作的诊断信息，从而使模型改进过程充满挑战。模型开发者不得不投入大量人工劳动，在庞大数据集中筛选数据，并对训练数据或配置进行反复试错式调整。本研究针对定量指标的缺陷，提出QualEval方法——通过将自动化定性评估作为模型改进的载体，增强定量标量指标的效能。QualEval利用强大的大语言模型推理器与新型灵活线性规划求解器，生成可读性强的洞察信息，可加速模型改进过程。这些洞察配备包含细粒度可视化与人类可解释性分析的综合仪表盘。我们验证了QualEval的可靠性：在具有挑战性的对话任务（DialogSum）中，利用其洞察信息使Llama 2模型的绝对性能相较基准方法最高提升15个百分点。QualEval成功提升模型开发效率，本质上是“数据科学家即服务”的具象化实现。鉴于其聚焦于现有评估指标的批判与改进，本方法为模型评估与优化领域提供了具有革新意义的新技术路径。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日