Ensuring that large language models (LMs) are fair, robust and useful requires an understanding of how different modifications to their inputs impact the model's behaviour. In the context of open-text generation tasks, however, such an evaluation is not trivial. For example, when introducing a model with an input text and a perturbed, "contrastive" version of it, meaningful differences in the next-token predictions may not be revealed with standard decoding strategies. With this motivation in mind, we propose Contrastive Input Decoding (CID): a decoding algorithm to generate text given two inputs, where the generated text is likely given one input but unlikely given the other. In this way, the contrastive generations can highlight potentially subtle differences in how the LM output differs for the two inputs in a simple and interpretable manner. We use CID to highlight context-specific biases that are hard to detect with standard decoding strategies and quantify the effect of different input perturbations.
翻译:确保大型语言模型(LM)公平、稳健和实用,需要理解输入的不同修改如何影响模型的行为。然而,在开放式文本生成任务中,这种评估并非易事。例如,当向模型输入一个原始文本及其扰动后的“对比”版本时,标准解码策略可能无法揭示下一个词元预测中的有意义差异。基于这一动机,我们提出了对比输入解码(CID):一种解码算法,用于根据两个输入生成文本,使得生成的文本在给定一个输入时可能性较高,但在给定另一个输入时可能性较低。通过这种方式,对比生成可以以简单且可解释的方式突出显示LM输出在两种输入下可能存在的细微差异。我们使用CID来突出显示标准解码策略难以检测到的特定上下文偏见,并量化不同输入扰动的影响。