Prompting and in-context learning (ICL) have become efficient learning paradigms for large language models (LLMs). However, LLMs suffer from prompt brittleness and various bias factors in the prompt, including but not limited to the formatting, the choice verbalizers, and the ICL examples. To address this problem that results in unexpected performance degradation, calibration methods have been developed to mitigate the effects of these biases while recovering LLM performance. In this work, we first conduct a systematic analysis of the existing calibration methods, where we both provide a unified view and reveal the failure cases. Inspired by these analyses, we propose Batch Calibration (BC), a simple yet intuitive method that controls the contextual bias from the batched input, unifies various prior approaches, and effectively addresses the aforementioned issues. BC is zero-shot, inference-only, and incurs negligible additional costs. In the few-shot setup, we further extend BC to allow it to learn the contextual bias from labeled data. We validate the effectiveness of BC with PaLM 2-(S, M, L) and CLIP models and demonstrate state-of-the-art performance over previous calibration baselines across more than 10 natural language understanding and image classification tasks.
翻译:提示工程和上下文学习已成为大语言模型的高效学习范式。然而,大语言模型存在提示脆弱性及提示中各类偏差因素(包括但不限于格式、言语化器选择、上下文学习示例)的问题。为解决这种导致意外性能下降的问题,研究人员开发了校准方法来减轻这些偏差的影响并恢复大语言模型性能。本研究首先对现有校准方法进行系统性分析,既提供了统一视角,又揭示了它们的失败案例。受这些分析启发,我们提出批次校准——一种简洁直观的方法,通过控制批量输入中的上下文偏差来统一多种现有方法,并有效解决上述问题。该方法是零样本、仅推理型方法,且计算成本可忽略不计。在少样本场景中,我们进一步扩展批次校准使其能够从标注数据中习得上下文偏差。我们通过PaLM 2-(S, M, L)和CLIP模型验证了批次校准的有效性,并在超过10项自然语言理解与图像分类任务中展示了优于先前校准基线的性能表现。