From grading papers to summarizing medical documents, large language models (LLMs) are evermore used for evaluation of text generated by humans and AI alike. However, despite their extensive utility, LLMs exhibit distinct failure modes, necessitating a thorough audit and improvement of their text evaluation capabilities. Here we introduce ALLURE, a systematic approach to Auditing Large Language Models Understanding and Reasoning Errors. ALLURE involves comparing LLM-generated evaluations with annotated data, and iteratively incorporating instances of significant deviation into the evaluator, which leverages in-context learning (ICL) to enhance and improve robust evaluation of text by LLMs. Through this iterative process, we aim to refine the performance of the evaluator LLM, ultimately reducing the reliance on human annotators in the evaluation process. We anticipate ALLURE to serve diverse applications of LLMs in various domains related to evaluation of textual data and productivity in these fields.
翻译:从批改论文到总结医学文档,大语言模型(LLM)越来越多地被用于评估人类和AI生成的文本。然而,尽管具有广泛的实用性,LLM仍表现出独特的失败模式,因此需要对其文本评估能力进行彻底的审计和改进。本文提出ALLURE——一种审计大语言模型理解与推理错误的系统性方法。ALLURE将LLM生成的评估结果与标注数据进行比较,并迭代地将存在显著偏差的实例纳入评估器中,利用上下文学习(ICL)增强和改进LLM对文本的稳健评估能力。通过这一迭代过程,我们旨在优化评估LLM的性能,最终减少评估过程中对人类标注者的依赖。我们预计ALLURE将服务于LLM在文本数据评估及相关领域的多样化应用,并提升这些领域的工作效率。