Automatic evaluation by large language models (LLMs) is a prominent topic today; however, judgment and evaluation tasks are often subjective and influenced by various factors, making adaptation challenging. While many studies demonstrate the capabilities of state-of-the-art proprietary LLMs in comparison to human evaluators, they often struggle to adapt to reference evaluators over time, a requirement for achieving personalized judgment. Additionally, numerous works have attempted to apply open LLMs as judges or evaluators, but these efforts frequently overlook the limitations of working with scarce data. Personalized judgment is inherently associated with limited data scenarios, which are common in many real-world problems. Our work aims to present a data augmentation technique to select a more effective sample from limited data in order to align an open LLM with human preference. Our work achieves approximately 7% improvements in Pearson correlation with a reference judge over the baseline,and 30% improvement over the base model (Llama3.1-8B-Instruct) in the mathematical reasoning evaluation task. demonstrating that augmenting selecting more effective preference data enables our approach to surpass baseline methods.
翻译:大型语言模型(LLM)的自动评估是当前的重要研究方向;然而,评判与评估任务常具主观性且受多种因素影响,使其适配面临挑战。尽管许多研究证明了前沿专有LLM相较于人类评估者的能力,但这些模型往往难以随时间推移适配参考评估者,而这是实现个性化评判的必要条件。此外,已有大量研究尝试将开源LLM用作评判者或评估器,但这些工作常忽视数据稀缺带来的局限。个性化评判本质上与有限数据场景相关联,此类场景在现实问题中普遍存在。本研究旨在提出一种数据增强技术,从有限数据中选择更有效的样本,使开源LLM与人类偏好对齐。在数学推理评估任务中,我们的方法相较于基线在皮尔逊相关系数上实现了约7%的提升,相较于基础模型(Llama3.1-8B-Instruct)提升了30%。这表明通过增强选择更有效的偏好数据,我们的方法能够超越基线方法。