Recent advances in open-domain text generation models powered by large pre-trained language models (LLMs) have achieved remarkable performance. However, evaluating and controlling these models for desired attributes remains a challenge, as traditional reference-based metrics such as BLEU, ROUGE, and METEOR are insufficient for open-ended generation tasks. Similarly, while trainable discriminator-based evaluation metrics show promise, obtaining high-quality training data is a non-trivial task. In this paper, we introduce a novel approach to evaluate open-domain generation - the Meta-Distribution Methods (MDM). Drawing on the correlation between the rising parameter counts and the improving performance of LLMs, MDM creates a mapping from the contrast of two probabilistic distributions -- one known to be superior to the other -- to quality measures, which can be viewed as a distribution of distributions i.e. Meta-Distribution. We investigate MDM for open-domain text generation evaluation under two paradigms: 1) \emph{Generative} MDM, which leverages the Meta-Distribution Methods to generate in-domain negative samples for training discriminator-based metrics; 2) \emph{Discriminative} MDM, which directly uses distribution discrepancies between two language models for evaluation. Our experiments on multi-turn dialogue and factuality in abstractive summarization demonstrate that MDMs correlate better with human judgment than existing automatic evaluation metrics on both tasks, highlighting the strong performance and generalizability of such methods.
翻译:近年来,依托大规模预训练语言模型(LLMs)驱动的开放域文本生成模型取得了显著进展。然而,对这些模型进行期望属性的评估与控制仍具挑战性,因为传统的基于参考指标的评估方法(如BLEU、ROUGE、METEOR)在开放式生成任务中表现不足。尽管可训练判别器式评估指标展现出潜力,但获取高质量训练数据并非易事。本文提出了一种评估开放域文本生成的新方法——元分布方法(Meta-Distribution Methods, MDM)。基于LLM参数规模增长与性能提升之间的相关性,MDM通过建立两种概率分布(其中一种已知优于另一种)的对比与质量度量之间的映射,构建了"分布的分布"即元分布。我们从两个范式探索MDM在开放域文本生成评估中的应用:1)生成式MDM:利用元分布方法生成域内负样本,用于训练判别器式评估指标;2)判别式MDM:直接利用两个语言模型之间的分布差异进行评估。在多轮对话与抽象摘要事实性评估的实验表明,MDM在两项任务上均比现有自动评估指标更接近人工判断,凸显了该方法较强的性能与泛化能力。