Prompting and in-context learning (ICL) have become efficient learning paradigms for large language models (LLMs). However, LLMs suffer from prompt brittleness and various bias factors in the prompt, including but not limited to the formatting, the choice verbalizers, and the ICL examples. To address this problem that results in unexpected performance degradation, calibration methods have been developed to mitigate the effects of these biases while recovering LLM performance. In this work, we first conduct a systematic analysis of the existing calibration methods, where we both provide a unified view and reveal the failure cases. Inspired by these analyses, we propose Batch Calibration (BC), a simple yet intuitive method that controls the contextual bias from the batched input, unifies various prior approaches, and effectively addresses the aforementioned issues. BC is zero-shot, inference-only, and incurs negligible additional costs. In the few-shot setup, we further extend BC to allow it to learn the contextual bias from labeled data. We validate the effectiveness of BC with PaLM 2-(S, M, L) and CLIP models and demonstrate state-of-the-art performance over previous calibration baselines across more than 10 natural language understanding and image classification tasks.
翻译:提示学习与上下文学习已成为大型语言模型的高效学习范式。然而,大型语言模型存在提示脆弱性问题,且提示中可能包含多种偏差因素(包括但不限于格式设计、词汇选择器及上下文学习示例)。为应对由此导致的性能意外下降问题,校准方法应运而生,旨在消除这些偏差影响并恢复模型性能。本研究首先对现有校准方法进行系统性分析,通过统一框架揭示其失效案例。基于这些分析,我们提出批量校准——一种通过批输入控制上下文偏差的直观方法。该方法统一了多种现有方案,能有效解决前述问题,且具备零样本、纯推理、附加成本可忽略等特性。在少样本场景中,我们进一步扩展批量校准框架,使其能够从标注数据中学习上下文偏差。通过在PaLM 2-(S, M, L)和CLIP模型上的实验验证,批量校准在超过10项自然语言理解与图像分类任务中均超越现有校准基线,达到最先进的性能水平。