Despite the remarkable advances in language modeling, current mainstream decoding methods still struggle to generate texts that align with human texts across different aspects. In particular, sampling-based methods produce less-repetitive texts which are often disjunctive in discourse, while search-based methods maintain topic coherence at the cost of increased repetition. Overall, these methods fall short in achieving holistic alignment across a broad range of aspects. In this work, we frame decoding from a language model as an optimization problem with the goal of strictly matching the expected performance with human texts measured by multiple metrics of desired aspects simultaneously. The resulting decoding distribution enjoys an analytical solution that scales the input language model distribution via a sequence-level energy function defined by these metrics. And most importantly, we prove that this induced distribution is guaranteed to improve the perplexity on human texts, which suggests a better approximation to the underlying distribution of human texts. To facilitate tractable sampling from this globally normalized distribution, we adopt the Sampling-Importance-Resampling technique. Experiments on various domains and model scales demonstrate the superiority of our method in metrics alignment with human texts and human evaluation over strong baselines.
翻译:尽管语言建模取得了显著进展,当前主流的解码方法仍难以生成在多个方面与人类文本对齐的文本。具体而言,基于采样的方法生成重复性较低的文本,但这些文本往往在语篇上缺乏连贯性;而基于搜索的方法在保持主题连贯性的同时,却以增加重复性为代价。总体而言,这些方法在实现跨广泛方面的整体对齐方面存在不足。在本工作中,我们将语言模型解码视为一个优化问题,其目标是严格匹配人类文本在多个期望方面的度量指标所衡量的预期性能。由此产生的解码分布具有解析解,该解析解通过由这些度量定义的序列级能量函数来缩放输入语言模型分布。最重要的是,我们证明该诱导分布能够保证提升人类文本的困惑度,这表明其更接近人类文本的潜在分布。为了便于从这一全局归一化分布中进行可操作的采样,我们采用了采样-重要性-重采样技术。在多个领域和模型规模上的实验表明,我们的方法在度量对齐和人工评估方面均优于强基线模型。