Decoding methods for large language models often trade-off between diversity of outputs and parallelism of computation. Methods such as beam search and Gumbel top-k sampling can guarantee a different output for each element of the beam, but are not easy to parallelize. Alternatively, methods such as temperature sampling and its modifications (top-k sampling, nucleus sampling, typical decoding, and others), are embarrassingly parallel, but have no guarantees about duplicate samples. We present a framework for sampling according to an arithmetic code book implicitly defined by a large language model, compatible with common sampling variations, with provable beam diversity under certain conditions, as well as being embarrassingly parallel and providing unbiased and consistent expectations from the original model. We demonstrate the effectiveness of our approach on WMT machine translation, more than halving the standard deviation when estimating expected BLEU score reward, and closing the BLEU score gap between independent sampling and beam search by up to 63%.
翻译:大型语言模型的解码方法通常需要在输出多样性与计算并行性之间权衡。波束搜索和Gumbel top-k采样等方法虽能保证波束中每个元素生成不同输出,但难以并行化。而温度采样及其变体(如top-k采样、核采样、典型解码等)虽天然可并行化,却无法避免重复样本。本文提出一种基于算术编码本(由大型语言模型隐式定义)的采样框架,该框架兼容常见采样变体,在特定条件下可证明波束多样性,同时具备天然并行性,并能从原始模型中获得无偏且一致的期望。我们在WMT机器翻译任务上验证了该方法的效果:将期望BLEU分数奖励的标准差降低超过一半,并将独立采样与波束搜索之间的BLEU分数差距缩小高达63%。