Knowledge distillation (KD) is an essential technique to compress large language models (LLMs) into smaller ones. However, despite the distinct roles of the student model and the teacher model in KD, most existing frameworks still use a homogeneous training backend (e.g., FSDP and DeepSpeed) for both models, leading to suboptimal training efficiency. In this paper, we present a novel framework for LLM distillation, termed \textbf{KDFlow}, which features a decoupled architecture and employs SGLang for teacher inference. By bridging the training efficiency of FSDP2 and the inference efficiency of SGLang, KDFlow achieves full utilization of both advantages in a unified system. Moreover, instead of transferring full logits across different processes, our framework only transmits the teacher's hidden states using zero-copy data transfer and recomputes the logits on the student side, effectively balancing the communication cost and KD performance. Furthermore, our framework supports both off-policy and on-policy distillation and incorporates KD algorithms for cross-tokenizer KD through highly extensible and user-friendly APIs. Experiments show that KDFlow can achieve \textbf{1.44$\times$ to 6.36$\times$} speedup compared to current KD frameworks, enabling researchers to rapidly prototype and scale LLM distillation with minimal engineering overhead. Code is available at: https://github.com/songmzhang/KDFlow
翻译:知识蒸馏(KD)是将大型语言模型(LLM)压缩为更小模型的关键技术。然而,尽管学生模型与教师模型在知识蒸馏中扮演着截然不同的角色,现有大多数框架仍对两者采用同质化的训练后端(如FSDP和DeepSpeed),导致训练效率欠佳。本文提出了一种新型LLM蒸馏框架,命名为\textbf{KDFlow},该框架采用解耦架构,并利用SGLang进行教师模型推理。通过桥接FSDP2的训练效率与SGLang的推理效率,KDFlow在统一系统中实现了两者优势的充分融合。此外,该框架无需跨进程传输完整logits,而是利用零拷贝数据传输仅传递教师隐状态,并在学生端重新计算logits,有效平衡了通信开销与知识蒸馏性能。进一步地,KDFlow支持离线策略蒸馏与在线策略蒸馏,并通过高度可扩展且用户友好的API整合了跨分词器知识蒸馏算法。实验表明,相比现有知识蒸馏框架,KDFlow可实现\textbf{1.44倍至6.36倍}的加速,使研究人员能够以极低的工程开销快速原型验证与规模化部署LLM蒸馏。代码开源地址:https://github.com/songmzhang/KDFlow