Various design settings for in-context learning (ICL), such as the choice and order of the in-context examples, can bias a model toward a particular prediction without being reflective of an understanding of the task. While many studies discuss these design choices, there have been few systematic investigations into categorizing them and mitigating their impact. In this work, we define a typology for three types of label biases in ICL for text classification: vanilla-label bias, context-label bias, and domain-label bias (which we conceptualize and detect for the first time). Our analysis demonstrates that prior label bias calibration methods fall short of addressing all three types of biases. Specifically, domain-label bias restricts LLMs to random-level performance on many tasks regardless of the choice of in-context examples. To mitigate the effect of these biases, we propose a simple bias calibration method that estimates a language model's label bias using random in-domain words from the task corpus. After controlling for this estimated bias when making predictions, our novel domain-context calibration significantly improves the ICL performance of GPT-J and GPT-3 on a wide range of tasks. The gain is substantial on tasks with large domain-label bias (up to 37% in Macro-F1). Furthermore, our results generalize to models with different scales, pretraining methods, and manually-designed task instructions, showing the prevalence of label biases in ICL.
翻译:上下文学习(ICL)的各种设计设置,例如上下文示例的选择和顺序,可能会使模型偏向特定预测,而不反映对任务的理解。尽管许多研究讨论了这些设计选择,但很少有系统性的研究来对其进行分类并减轻其影响。在这项工作中,我们定义了文本分类中ICL三种标签偏差的类型:朴素标签偏差、上下文标签偏差和领域标签偏差(我们首次对其进行概念化和检测)。我们的分析表明,先前的标签偏差校准方法未能解决所有三种类型的偏差。具体而言,领域标签偏差限制了LLMs在许多任务上的表现,使其仅达到随机水平,而无论上下文示例的选择如何。为了减轻这些偏差的影响,我们提出了一种简单的偏差校准方法,该方法使用任务语料库中的随机领域内词来估计语言模型的标签偏差。在做出预测时控制这一估计偏差后,我们新颖的领域-上下文校准显著提高了GPT-J和GPT-3在广泛任务上的ICL性能。在具有较大领域标签偏差的任务上,增益尤为显著(Macro-F1提升高达37%)。此外,我们的结果推广到不同规模、预训练方法和手动设计的任务指令的模型,显示了ICL中标签偏差的普遍性。