Summarization models often generate text that is poorly calibrated to quality metrics because they are trained to maximize the likelihood of a single reference (MLE). To address this, recent work has added a calibration step, which exposes a model to its own ranked outputs to improve relevance or, in a separate line of work, contrasts positive and negative sets to improve faithfulness. While effective, much of this work has focused on how to generate and optimize these sets. Less is known about why one setup is more effective than another. In this work, we uncover the underlying characteristics of effective sets. For each training instance, we form a large, diverse pool of candidates and systematically vary the subsets used for calibration fine-tuning. Each selection strategy targets distinct aspects of the sets, such as lexical diversity or the size of the gap between positive and negatives. On three diverse scientific long-form summarization datasets (spanning biomedical, clinical, and chemical domains), we find, among others, that faithfulness calibration is optimal when the negative sets are extractive and more likely to be generated, whereas for relevance calibration, the metric margin between candidates should be maximized and surprise--the disagreement between model and metric defined candidate rankings--minimized. Code to create, select, and optimize calibration sets is available at https://github.com/griff4692/calibrating-summaries
翻译:摘要:摘要生成模型通常由于仅以最大化单个参考文本的似然(MLE)为目标进行训练,导致生成文本与质量指标的校准效果不佳。为解决这一问题,近期研究引入了校准步骤:一方面通过向模型展示其自身排序后的输出结果来提升相关性;另一方面,在另一独立研究线路中,通过对比正样本集与负样本集来增强忠实度。尽管上述方法有效,但现有工作多集中于如何生成与优化这些样本集,对不同设置间产生效果差异的根本原因却知之甚少。本研究旨在揭示有效样本集的底层特征。针对每个训练实例,我们构建一个大规模多样化的候选池,并系统地改变用于校准微调的子集。每种选择策略针对样本集的不同维度,例如词汇多样性或正负样本间的差距大小。在三个多样化的科学长文本摘要数据集(涵盖生物医学、临床医学和化学领域)中,我们发现:当负样本集具备抽取式特征且更易生成时,忠实度校准效果最优;而对于相关性校准,则需最大化候选样本间的指标差距,同时最小化“惊讶度”——即模型与指标定义下候选样本排名之间的分歧。创建、选择及优化校准集的代码已发布于 https://github.com/griff4692/calibrating-summaries