Supervised finetuning (SFT) on instruction datasets has played a crucial role in achieving the remarkable zero-shot generalization capabilities observed in modern large language models (LLMs). However, the annotation efforts required to produce high quality responses for instructions are becoming prohibitively expensive, especially as the number of tasks spanned by instruction datasets continues to increase. Active learning is effective in identifying useful subsets of samples to annotate from an unlabeled pool, but its high computational cost remains a barrier to its widespread applicability in the context of LLMs. To mitigate the annotation cost of SFT and circumvent the computational bottlenecks of active learning, we propose using experimental design. Experimental design techniques select the most informative samples to label, and typically maximize some notion of uncertainty and/or diversity. In our work, we implement a framework that evaluates several existing and novel experimental design techniques and find that these methods consistently yield significant gains in label efficiency with little computational overhead. On generative tasks, our methods achieve the same generalization performance with only $50\%$ of annotation cost required by random sampling.
翻译:监督微调在指令数据集上的应用,对于实现现代大型语言模型(LLMs)中显著的零样本泛化能力起到了关键作用。然而,为指令生成高质量响应所需的标注成本正变得日益高昂,尤其是当指令数据集涵盖的任务数量不断增加时。主动学习能够有效从无标签池中识别出值得标注的有用样本子集,但其高昂的计算成本仍是阻碍其在大语言模型环境中广泛应用的瓶颈。为了降低监督微调的标注成本并绕过主动学习的计算瓶颈,我们提出使用实验设计方法。实验设计技术能够选择信息量最大的样本进行标注,通常最大化某种不确定性或多样性指标。在我们的工作中,我们实现了一个框架来评估多种现有及新颖的实验设计技术,并发现这些方法在几乎不增加计算开销的情况下,始终能显著提升标签效率。在生成式任务中,我们的方法仅需随机采样所需标注成本的50%,即可达到相同的泛化性能。