Many recent efforts aim to augment language models with relevant information retrieved from a database at test time. We avoid the need for prompt engineering by directly fine-tuning the model on data retrieved at test time using its standard training setup. For this purpose, we build a large-scale distributed nearest neighbor index based on text embeddings of the Pile dataset. Given a query to a language model, our system retrieves the neighbors of the query and fine-tunes the model on the text data corresponding to those neighbors. Surprisingly, retrieving and training on as few as 20 neighbors, each for only one gradient iteration, drastically improves performance across more than twenty language modeling tasks in the Pile benchmark. For example, test-time training significantly narrows the performance gap between a small GPT2 model and a GPTNeo model, more than ten times larger, that was specifically trained to convergence on the Pile. Sufficient index quality and size, however, are important. Our work establishes a valuable first baseline for implementing test-time training in the context of large language models, opening the door to numerous promising research avenues.
翻译:近期诸多研究致力于通过在测试时从数据库检索相关信息来增强语言模型。我们通过直接在测试时利用标准训练设置微调检索到的数据,避免了提示工程的需求。为此,我们基于Pile数据集文本嵌入构建了大规模分布式最近邻索引。当语言模型收到查询时,我们的系统会检索该查询的邻居,并基于这些邻居的文本数据对模型进行微调。令人惊讶的是,仅检索并训练20个邻居(每个仅进行一次梯度迭代),便能在Pile基准测试的二十余项语言建模任务中显著提升性能。例如,测试时训练大幅缩小了小型GPT2模型与规模十倍以上、专为Pile数据收敛训练的GPTNeo模型之间的性能差距。然而,索引的质量与规模至关重要。本研究为在大型语言模型中实现测试时训练建立了首个有价值基线,为众多有前景的研究方向开辟了道路。