Recent efforts in fine-tuning language models often rely on automatic data selection, commonly using Nearest Neighbors retrieval from large datasets. However, we theoretically show that this approach tends to select redundant data, limiting its effectiveness or even hurting performance. To address this, we introduce SIFT, a data selection algorithm designed to reduce uncertainty about the model's response given a prompt, which unifies ideas from retrieval and active learning. Whereas Nearest Neighbor retrieval typically fails in the presence of information duplication, SIFT accounts for information duplication and optimizes the overall information gain of the selected examples. We focus our evaluations on fine-tuning at test-time for prompt-specific language modeling on the Pile dataset, and show that SIFT consistently outperforms Nearest Neighbor retrieval, with minimal computational overhead. Moreover, we show that our uncertainty estimates can predict the performance gain of test-time fine-tuning, and use this to develop an adaptive algorithm that invests test-time compute proportional to realized performance gains. We provide the $\texttt{activeft}$ (Active Fine-Tuning) library which can be used as a drop-in replacement for Nearest Neighbor retrieval.
翻译:近期语言模型微调工作常依赖自动数据选择,通常采用从大型数据集中进行最近邻检索的方法。然而,我们从理论上证明这种方法倾向于选择冗余数据,从而限制其有效性甚至损害性能。为解决此问题,我们提出了SIFT数据选择算法,该算法旨在降低给定提示下模型响应的不确定性,融合了检索与主动学习的核心思想。当最近邻检索在存在信息重复时通常失效的情况下,SIFT能够考虑信息重复并优化所选样本的整体信息增益。我们在Pile数据集上针对特定提示的语言建模任务进行测试时微调评估,结果表明SIFT在仅增加极小计算开销的情况下,持续优于最近邻检索方法。此外,我们证明所提出的不确定性估计能够预测测试时微调的性能增益,并基于此开发了一种自适应算法,可根据实际性能增益按比例分配测试时计算资源。我们提供了$\texttt{activeft}$(主动微调)开源库,可作为最近邻检索的直接替代方案使用。