Self-consistency has emerged as a powerful method for improving the accuracy of short answers generated by large language models. As previously defined, it only concerns the accuracy of a final answer parsed from generated text. In this work, we extend the idea to open response generation, by integrating voting into the decoding method. Each output sentence is selected from among multiple samples, conditioning on the previous selections, based on a simple token overlap score. We compare this "Sample & Select" method to greedy decoding, beam search, nucleus sampling, and the recently introduced hallucination avoiding decoders of DoLA, P-CRR, and S-CRR. We show that Sample & Select improves factuality by a 30% relative margin against these decoders in NLI-based evaluation on the subsets of CNN/DM and XSum used in the FRANK benchmark, while maintaining comparable ROUGE-1 F1 scores against reference summaries. We collect human verifications of the generated summaries, confirming the factual superiority of our method.
翻译:自洽性已成为提升大语言模型生成简短答案准确性的有效方法。根据先前定义,该方法仅关注从生成文本中解析出的最终答案的准确性。在本研究中,我们将这一理念拓展至长篇回复生成,通过将投票机制融入解码方法。每个输出句子基于简单的词元重叠评分,从多个样本中依据先前选择结果进行筛选。我们将这种"样本与选择"方法与贪婪解码、束搜索、核心采样以及近年提出的幻觉规避解码器(DoLA、P-CRR、S-CRR)进行了对比。实验表明,在FRANK基准测试中基于CNN/DM和XSum子集的NLI评估中,"样本与选择"方法相对于这些解码器将事实准确性提升了30%的相对边际,同时在与参考摘要的ROUGE-1 F1得分上保持相当水平。我们收集了人工对生成摘要的验证结果,进一步证实了该方法在事实性方面的优越性。