Large language models (LLMs) have demonstrated significant capability to generalize across a large number of NLP tasks. For industry applications, it is imperative to assess the performance of the LLM on unlabeled production data from time to time to validate for a real-world setting. Human labeling to assess model error requires considerable expense and time delay. Here we demonstrate that ensemble disagreement scores work well as a proxy for human labeling for language models in zero-shot, few-shot, and fine-tuned settings, per our evaluation on keyphrase extraction (KPE) task. We measure fidelity of the results by comparing to true error measured from human labeled ground truth. We contrast with the alternative of using another LLM as a source of machine labels, or silver labels. Results across various languages and domains show disagreement scores provide a better estimation of model performance with mean average error (MAE) as low as 0.4% and on average 13.8% better than using silver labels.
翻译:大型语言模型(LLMs)已展现出在大量NLP任务中泛化的显著能力。对于工业应用而言,周期性地评估LLM在未标注生产数据上的性能以验证其实战表现至关重要。通过人工标注评估模型误差需要高昂的成本和时间延迟。本文通过关键词提取(KPE)任务的实验表明,集成分歧评分在零样本、少样本和微调场景下均可作为语言模型人类标注的有效代理。通过对比人工标注的真实误差,我们测量了结果的保真度,并与替代方案(即使用另一LLM生成的机器标注或银标准标注)进行对比。跨语言和领域的实验结果显示,分歧评分能更优地估计模型性能,平均绝对误差(MAE)低至0.4%,且平均比银标准标注方法提升13.8%。