Continual learning has emerged as a promising solution to refine models incrementally by leveraging user feedback, thereby enhancing model performance in applications like code completion, personal assistants, and chat interfaces. In particular, online continual learning - iteratively training the model with small batches of user feedback - has demonstrated notable performance improvements. However, the existing practice of segregating training and serving processes forces the online trainer to recompute the intermediate results already done during serving. Such redundant computations can account for 30%-42% of total training time. In this paper, we propose Alchemist, to the best of our knowledge, the first online continual learning system that efficiently reuses intermediate results computed during serving to reduce redundant computation with minimal impact on the serving latency or capacity. Alchemist introduces two key techniques: (1) minimal activations recording and saving during serving, where activations are recorded and saved only during the prefill phase to minimize overhead; and (2) offloading of serving activations, which dynamically manages GPU memory by freeing activations in the forward order, while reloading them in the backward order during the backward pass. Evaluations with the ShareGPT dataset show that compared with a separate training cluster, Alchemist significantly increases training throughput by up to 1.72x, reduces up to 47% memory usage during training, and supports up to 2x more training tokens - all while maintaining negligible impact on serving latency.
翻译:持续学习已成为一种有前景的解决方案,通过利用用户反馈逐步优化模型,从而提升代码补全、个人助理和聊天界面等应用中的模型性能。特别是,在线持续学习——即使用小批量用户反馈迭代训练模型——已展现出显著的性能提升。然而,现有将训练与服务过程分离的做法,迫使在线训练器重新计算在服务阶段已完成的时间结果。此类冗余计算可占总训练时间的30%-42%。本文提出Alchemist,据我们所知,这是首个高效复用服务阶段计算所得中间结果以减少冗余计算的在线持续学习系统,且对服务延迟或容量影响极小。Alchemist引入两项关键技术:(1) 服务期间的最小化激活记录与保存,仅在预填充阶段记录并保存激活以最小化开销;(2) 服务激活卸载,通过按前向顺序释放激活来动态管理GPU内存,同时在反向传播期间按反向顺序重新加载它们。使用ShareGPT数据集的评估表明,与独立的训练集群相比,Alchemist将训练吞吐量最高提升1.72倍,减少高达47%的训练内存使用,并支持多达2倍的训练词元——所有这些均在对服务延迟影响可忽略不计的前提下实现。