While large language models (LLMs) demonstrate reasonable zero-shot capability across many downstream tasks, fine-tuning is a common practice to improve their performance. However, a task's data efficiency--i.e., the number of fine-tuning examples needed to achieve a desired level of performance--is often unknown, resulting in costly cycles of incremental annotation and retraining. Indeed, we demonstrate across a curated set of 30 specialized tasks that performant LLMs may struggle zero-shot but can attain stronger performance after fine-tuning. This motivates the need for methods to predict a task's data efficiency without requiring incremental annotation. After introducing a concrete metric that quantifies a task's data efficiency, we propose using the gradient cosine similarity of low-confidence examples to predict data efficiency based on a small number of labeled samples. We validate our approach on a diverse set of tasks with varying data efficiencies, attaining 8.6% error in overall data efficiency prediction and typically eliminating hundreds of unnecessary annotations on each task. Our experiment results and implementation code are available on GitHub.
翻译:尽管大型语言模型(LLM)在许多下游任务中展现出合理的零样本能力,微调仍是提升其性能的常用方法。然而,任务的数据效率——即达到预期性能水平所需的微调样本数量——通常是未知的,这导致了昂贵的增量标注与重新训练循环。事实上,我们在精选的30个专业任务上证明,性能良好的LLM可能在零样本设置下表现不佳,但经过微调后能够获得更强的性能。这凸显了需要一种无需增量标注即可预测任务数据效率的方法。在引入量化任务数据效率的具体指标后,我们提出利用低置信度样本的梯度余弦相似性,基于少量标注样本预测数据效率。我们在具有不同数据效率的多样化任务集上验证了该方法,实现了整体数据效率预测8.6%的误差,并通常能在每个任务上减少数百次不必要的标注。我们的实验结果与实现代码已在GitHub上公开。