Recent advancements in large language models (LLMs) on language modeling and emergent capabilities make them a promising reference-free evaluator of natural language generation quality, and a competent alternative to human evaluation. However, hindered by the closed-source or high computational demand to host and tune, there is a lack of practice to further calibrate an off-the-shelf LLM-based evaluator towards better human alignment. In this work, we propose AutoCalibrate, a multi-stage, gradient-free approach to automatically calibrate and align an LLM-based evaluator toward human preference. Instead of explicitly modeling human preferences, we first implicitly encompass them within a set of human labels. Then, an initial set of scoring criteria is drafted by the language model itself, leveraging in-context learning on different few-shot examples. To further calibrate this set of criteria, we select the best performers and re-draft them with self-refinement. Our experiments on multiple text quality evaluation datasets illustrate a significant improvement in correlation with expert evaluation through calibration. Our comprehensive qualitative analysis conveys insightful intuitions and observations on the essence of effective scoring criteria.
翻译:近年来,大型语言模型(LLMs)在语言建模和涌现能力方面取得的进展,使其成为一种有前途的、无需参考的自然语言生成质量评估器,并成为人类评估的有效替代方案。然而,由于闭源或托管与调优所需的高计算需求,目前缺乏进一步校准现成基于LLM的评估器以使其更符合人类偏好的实践。本文提出AutoCalibrate,一种多阶段、无梯度方法,用于自动校准和调整基于LLM的评估器以符合人类偏好。我们并未显式建模人类偏好,而是首先将其隐式包含在一组人工标注中。随后,语言模型自身利用不同小样本示例上的上下文学习,起草初始评分标准集。为进一步校准该标准集,我们选出表现最佳的评估标准,并通过自我优化重新起草。我们在多个文本质量评估数据集上的实验表明,校准后评估结果与专家评估的相关性显著提升。全面的定性分析揭示了有效评分标准的本质及蕴含的直觉性见解和观察结果。