Real-world applications of neural language models often involve running many different models over the same corpus. The high computational cost of these runs has led to interest in techniques that can reuse the contextualized embeddings produced in previous runs to speed training and inference of future ones. We refer to this approach as embedding recycling (ER). While multiple ER techniques have been proposed, their practical effectiveness is still unknown because existing evaluations consider very few models and do not adequately account for overhead costs. We perform an extensive evaluation of ER across eight different models (17 to 900 million parameters) and fourteen tasks in English. We show how a simple ER technique that caches activations from an intermediate layer of a pretrained model, and learns task-specific adapters on the later layers, is broadly effective. For the best-performing baseline in our experiments (DeBERTa-v2 XL), adding a precomputed cache results in a >90% speedup during training and 87-91% speedup for inference, with negligible impact on accuracy. Our analysis reveals important areas of future work.
翻译:神经网络语言模型的实际应用通常需要在同一语料库上运行多种不同的模型。这些运行的高计算成本促使研究人员关注能够重复使用先前运行中生成的上下文嵌入的技术,以加速后续模型的训练与推理。我们将这种方法称为嵌入回收(ER)。尽管已有多种ER技术被提出,但其实际效果仍不明确,因为现有评估仅涉及极少数模型,且未充分考虑额外开销成本。我们针对八种不同模型(参数规模从1700万到9亿)及英语的十四项任务进行了广泛的ER性能评估。研究表明,一种简单的ER技术——缓存预训练模型中间层的激活值,并在后续层学习任务特定适配器——具有广泛有效性。在我们实验的最佳基线模型(DeBERTa-v2 XL)中,添加预计算缓存可使训练速度提升超过90%,推理速度提升87%-91%,且对准确率影响极小。我们的分析揭示了未来研究的重要方向。