Evaluating the quality of machine-generated natural language content is a challenging task in Natural Language Processing (NLP). Recently, large language models (LLMs) like GPT-4 have been employed for this purpose, but they are computationally expensive due to the extensive token usage required by complex evaluation prompts. In this paper, we propose a prompt optimization approach that uses a smaller, fine-tuned language model to compress input data for evaluation prompt, thus reducing token usage and computational cost when using larger LLMs for downstream evaluation. Our method involves a two-stage fine-tuning process: supervised fine-tuning followed by preference optimization to refine the model's outputs based on human preferences. We focus on Machine Translation (MT) evaluation and utilize the GEMBA-MQM metric as a starting point. Our results show a $2.37\times$ reduction in token usage without any loss in evaluation quality. This work makes state-of-the-art LLM-based metrics like GEMBA-MQM more cost-effective and efficient, enhancing their accessibility for broader use.
翻译:评估机器生成的自然语言内容质量是自然语言处理领域的一项挑战性任务。近期,诸如GPT-4等大型语言模型已被用于此目的,但由于复杂评估提示需要大量令牌使用,其计算成本高昂。本文提出一种提示优化方法,利用一个经过微调的小型语言模型来压缩评估提示的输入数据,从而在使用大型LLM进行下游评估时减少令牌使用量和计算成本。我们的方法包含两阶段微调过程:首先进行监督微调,随后通过偏好优化根据人类偏好精炼模型输出。我们专注于机器翻译评估任务,并以GEMBA-MQM指标作为起点。实验结果表明,在保持评估质量无损的前提下,令牌使用量降低了$2.37$倍。这项工作使得GEMBA-MQM等基于LLM的先进评估指标更具成本效益和效率,提升了其在更广泛场景中的可及性。