When pretrained language models (LMs) are applied to discriminative tasks such as multiple-choice questions, they place probability mass on vocabulary tokens that aren't among the given answer choices. Spreading probability mass across multiple surface forms with identical meaning (such as "bath" and "bathtub") is thought to cause an underestimation of a model's true performance, referred to as the "surface form competition" (SFC) hypothesis. This has motivated the introduction of various probability normalization methods. However, many core questions remain unanswered. How do we measure SFC? Are there direct ways of reducing it, and does doing so improve task performance? We propose a mathematical formalism for SFC which allows us to quantify and bound its impact for the first time. We identify a simple method for reducing it -- namely, increasing probability mass on the given answer choices by a) including them in the prompt and b) using in-context learning with even just one example. We show this method eliminates the impact of SFC in the majority of instances. Our experiments on three diverse datasets and six LMs reveal several additional surprising findings. For example, both normalization and prompting methods for reducing SFC can be ineffective or even detrimental to task performance for some LMs. We conclude with practical insights for effectively prompting LMs for multiple-choice tasks.
翻译:当预训练语言模型应用于判别式任务(如多项选择题)时,它们会将概率质量分配给不属于给定答案选项的词汇标记。将概率质量分散在具有相同含义的多个表面形式(如"bath"和"bathtub")上,被认为会导致模型真实性能被低估,这一现象被称为"表面形式竞争"假说。这推动了各种概率归一化方法的提出。然而,许多核心问题仍未得到解答:如何衡量表面形式竞争?是否存在直接减少它的方法?这样做是否能提高任务性能?我们首次提出了表面形式竞争的形式化数学框架,从而能够量化并界定其影响。我们确定了一种简单的减少方法——通过(a)将答案选项纳入提示和(b)使用少样本上下文学习(仅需一个示例)来增加答案选项上的概率质量。实验表明,该方法在大多数情况下消除了表面形式竞争的影响。我们在三个多样化数据集和六个语言模型上的实验揭示了多个令人惊讶的发现。例如,减少表面形式竞争的归一化和提示方法对某些语言模型可能无效,甚至对任务性能产生负面影响。最后,我们为有效提示语言模型完成多项选择题任务提供了实用见解。