Generations from large language models (LLMs) can be improved by sampling and scoring multiple solutions to select a final answer. Current "sample and select" methods such as self-consistency (SC) rely on majority voting to score answers. However, when tasks have many distinct and valid answers, selection by voting requires a large number of samples. This makes SC prohibitively expensive for interactive tasks that involve generating multiple actions (answers) sequentially. After establishing that majority voting fails to provide consistent gains on such tasks, we demonstrate how to increase success rates by softening the scoring criterion. We introduce Soft Self-Consistency (Soft-SC), which replaces SC's discontinuous scoring with a continuous score computed from model likelihoods, allowing for selection even when actions are sparsely distributed. Soft-SC improves both performance and efficiency on long-horizon interactive tasks, requiring half as many samples as SC for comparable or better performance. For a fixed number of samples, Soft-SC leads to a 1.3% increase over SC in absolute success rate on writing bash programs, a 6.6% increase on online shopping (WebShop), and a 4.7% increase for an interactive household game (ALFWorld). Finally, we show that Soft-SC can be applied to both open-source and black-box models.
翻译:通过采样和评分多个解决方案以选择最终答案,可以改进大型语言模型的生成结果。当前诸如自一致性(SC)等“采样并选择”方法依赖多数投票对答案进行评分。然而,当任务存在多个不同且有效的答案时,通过投票进行选择需要大量样本。这使得自一致性在涉及顺序生成多个动作(答案)的交互式任务中代价过高。在确认多数投票无法在此类任务中稳定提升性能后,我们展示了如何通过软化评分标准来提高成功率。我们提出软自一致性(Soft-SC),该方法用基于模型似然计算的连续评分替代自一致性的非连续评分,从而能够在动作分布稀疏时仍进行有效选择。软自一致性在长时交互式任务中同时提升了性能与效率,在达到相当或更优表现时仅需自一致性一半的样本。在固定样本数量下,软自一致性在编写bash程序任务中的绝对成功率比自一致性提升1.3%,在在线购物任务(WebShop)中提升6.6%,在交互式家庭模拟游戏(ALFWorld)中提升4.7%。最后,我们证明软自一致性可同时应用于开源模型和黑盒模型。