Following the standard supervised fine-tuning (SFT) paradigm, in-context learning (ICL) has become an efficient approach propelled by the recent advancements in large language models (LLMs), yielding promising performance across various tasks in few-shot data setups. However, both paradigms are prone to suffer from the critical problem of overconfidence (i.e., miscalibration), especially in such limited data setups. In this work, we deliver an in-depth analysis of the behavior across different choices of learning methods from the perspective of both performance and calibration, as well as their interplay. Through extensive controlled experiments, we find that simultaneous gains for both task performance and calibration are difficult to achieve, and the problem of miscalibration exists across all learning methods in low-resource scenarios. To address this challenging trade-off between performance and calibration, we then investigate the potential of self-ensembling techniques applied at different modeling stages (e.g., variations of in-context examples or variations in prompts or different ensembling strategies). We justify the feasibility of self-ensembling on SFT in addition to ICL, to make the predictions more calibrated and have comparable or even better performance. Our work sheds light on which learning paradigm to choose and how to enhance both task performance and calibration of LLMs.
翻译:遵循标准监督微调(SFT)范式,上下文学习(ICL)已成为一种高效方法,得益于大语言模型(LLMs)的最新进展,在少样本数据设置下的各类任务中展现出有前景的性能。然而,这两种范式在有限数据场景下均容易受到过度自信(即校准误差)这一关键问题的影响。本研究从性能与校准两个维度,深入分析了不同学习方法的行为特征及其相互影响。通过大量受控实验发现,任务性能与校准能力的同步提升难以实现,且所有学习方法在低资源场景下均存在校准误差问题。为应对性能与校准之间的这一挑战性权衡,我们进一步探究了在不同建模阶段(如上下文示例的变体、提示变体或不同集成策略)应用自集成技术的潜力。我们验证了除ICL外,自集成在SFT中的可行性,可使预测结果更具校准性并保持相当甚至更优的性能。本研究为如何选择学习范式、提升LLMs的任务性能与校准能力提供了启示。