Prompting and in-context learning (ICL) have become efficient learning paradigms for large language models (LLMs). However, LLMs suffer from prompt brittleness and various bias factors in the prompt, including but not limited to the formatting, the choice verbalizers, and the ICL examples. To address this problem that results in unexpected performance degradation, calibration methods have been developed to mitigate the effects of these biases while recovering LLM performance. In this work, we first conduct a systematic analysis of the existing calibration methods, where we both provide a unified view and reveal the failure cases. Inspired by these analyses, we propose Batch Calibration (BC), a simple yet intuitive method that controls the contextual bias from the batched input, unifies various prior approaches, and effectively addresses the aforementioned issues. BC is zero-shot, inference-only, and incurs negligible additional costs. In the few-shot setup, we further extend BC to allow it to learn the contextual bias from labeled data. We validate the effectiveness of BC with PaLM 2-(S, M, L) and CLIP models and demonstrate state-of-the-art performance over previous calibration baselines across more than 10 natural language understanding and image classification tasks.
翻译:提示(Prompting)和上下文学习(ICL)已成为大型语言模型(LLM)的高效学习范式。然而,LLM存在提示脆弱性及提示中多种偏差因素(包括但不限于格式、言语选择符的选择以及ICL示例)的问题。为应对由这些因素导致的意外性能下降问题,研究人员开发了校准方法来减轻这些偏差的影响,同时恢复LLM的性能。在本工作中,我们首先对现有校准方法进行了系统分析,既提供了统一视角,也揭示了其失败案例。受这些分析启发,我们提出批量校准(BC)——一种简单而直观的方法,通过控制批量输入中的上下文偏差,统一了多种先前方法,并有效解决了上述问题。BC方法无需训练,仅需推理,且额外成本可忽略不计。在少样本场景下,我们进一步扩展了BC,使其能够从标注数据中学习上下文偏差。我们通过PaLM 2-(S, M, L)和CLIP模型验证了BC的有效性,并在超过10项自然语言理解与图像分类任务中展示了相较于先前校准基线的先进性能。