Recent advancements in Large Language Model (LLM)-based Natural Language Generation evaluation have largely focused on single-example prompting, resulting in significant token overhead and computational inefficiencies. In this work, we introduce BatchGEMBA-MQM, a framework that integrates batched prompting with the GEMBA-MQM metric for machine translation evaluation. Our approach aggregates multiple translation examples into a single prompt, reducing token usage by 2-4 times (depending on the batch size) relative to single-example prompting. Furthermore, we propose a batching-aware prompt compression model that achieves an additional token reduction of 13-15% on average while also showing ability to help mitigate batching-induced quality degradation. Evaluations across several LLMs (GPT-4o, GPT-4o-mini, Mistral Small, Phi4, and CommandR7B) and varying batch sizes reveal that while batching generally negatively affects quality (but sometimes not substantially), prompt compression does not degrade further, and in some cases, recovers quality loss. For instance, GPT-4o retains over 90% of its baseline performance at a batch size of 4 when compression is applied, compared to a 44.6% drop without compression. We plan to release our code and trained models at https://github.com/NL2G/batchgemba to support future research in this domain.
翻译:近期基于大语言模型的自然语言生成评估研究主要集中于单示例提示,这导致了显著的令牌开销与计算效率低下。本文提出BatchGEMBA-MQM框架,该框架将批处理提示与GEMBA-MQM度量相结合,用于机器翻译评估。我们的方法将多个翻译示例聚合至单个提示中,相较于单示例提示,令牌使用量减少了2-4倍(具体取决于批次大小)。此外,我们提出了一种批处理感知的提示压缩模型,该模型平均可额外减少13-15%的令牌使用量,同时展现出缓解因批处理导致的质量下降的能力。在多种大语言模型(GPT-4o、GPT-4o-mini、Mistral Small、Phi4和CommandR7B)及不同批次大小下的评估表明:虽然批处理通常会对评估质量产生负面影响(但有时影响并不显著),但提示压缩不会进一步降低质量,在某些情况下甚至能恢复质量损失。例如,在应用压缩且批次大小为4时,GPT-4o能保持其基线性能的90%以上,而未压缩时性能下降达44.6%。我们计划在https://github.com/NL2G/batchgemba 发布代码与训练模型,以支持该领域的后续研究。