Large language models (LLMs) have demonstrated significant capability to generalize across a large number of NLP tasks. For industry applications, it is imperative to assess the performance of the LLM on unlabeled production data from time to time to validate for a real-world setting. Human labeling to assess model error requires considerable expense and time delay. Here we demonstrate that ensemble disagreement scores work well as a proxy for human labeling for language models in zero-shot, few-shot, and fine-tuned settings, per our evaluation on keyphrase extraction (KPE) task. We measure fidelity of the results by comparing to true error measured from human labeled ground truth. We contrast with the alternative of using another LLM as a source of machine labels, or silver labels. Results across various languages and domains show disagreement scores provide a better estimation of model performance with mean average error (MAE) as low as 0.4% and on average 13.8% better than using silver labels.
翻译:大语言模型(LLMs)在广泛自然语言处理任务中展现出显著的泛化能力。在工业应用场景中,需定期评估LLM在未标注生产数据上的性能,以验证其在实际环境中的表现。通过人工标注评估模型误差会带来高昂的时间与成本代价。本文针对关键短语抽取(KPE)任务,在零样本、少样本及微调三种设定下验证了:集成分歧分数可作为语言模型场景下人工标注的有效替代方案。通过与人工标注真值对比实测误差,我们评估了结果的保真度,并与采用其他LLM作为机器标注源(即银标准标注)的方法进行对比。跨语言和跨领域的实验结果表明,基于分歧分数的模型性能估计误差(平均绝对误差MAE最低达0.4%)显著优于银标准标注方法,平均性能提升13.8%。