Knowledge distillation (KD) is a tool to compress a larger system (teacher) into a smaller one (student). In machine translation, studies typically report only the translation quality of the student and omit the computational complexity of performing KD, making it difficult to select among the many available KD choices under compute-induced constraints. In this study, we evaluate representative KD methods by considering both translation quality and computational cost. We express computational cost as a carbon footprint using the machine learning life cycle assessment (MLCA) tool. This assessment accounts for runtime operational emissions and amortized hardware production costs throughout the KD model life cycle (teacher training, distillation, and inference). We find that (i) distillation overhead dominates the total footprint at small deployment volumes, (ii) inference dominates at scale, making KD beneficial only beyond a task-dependent usage threshold, and (iii) word-level distillation typically offers more favorable footprint-quality trade-offs than sequence-level distillation. Our protocol provides reproducible guidance for selecting KD methods under explicit quality and compute-induced constraints.
翻译:知识蒸馏(KD)是一种将较大系统(教师模型)压缩为较小系统(学生模型)的技术。在机器翻译领域,现有研究通常仅报告学生模型的翻译质量,而忽略了执行知识蒸馏的计算复杂度,这使得在计算资源约束下难以从众多可用的知识蒸馏方法中进行选择。本研究通过综合考虑翻译质量和计算成本,对代表性的知识蒸馏方法进行评估。我们使用机器学习生命周期评估(MLCA)工具将计算成本量化为碳足迹。该评估涵盖了知识蒸馏模型生命周期(教师模型训练、蒸馏过程及推理阶段)中的运行时运营排放以及分摊的硬件生产成本。研究发现:(i)在部署规模较小时,蒸馏过程的开销主导了总碳足迹;(ii)在规模化部署时,推理阶段成为主导因素,这使得知识蒸馏仅在超越任务相关的使用阈值后才具有效益;(iii)词级蒸馏通常比序列级蒸馏提供更有利的碳足迹-质量权衡。我们的评估方案为在明确的质量和计算约束下选择知识蒸馏方法提供了可复现的指导。