We study the problem of progressive distillation: Given a large, pre-trained teacher model $g$, we seek to decompose the model into an ensemble of smaller, low-inference cost student models $f_i$. The resulting ensemble allows for flexibly tuning accuracy vs. inference cost, which is useful for a number of applications in on-device inference. The method we propose, B-DISTIL, relies on an algorithmic procedure that uses function composition over intermediate activations to construct expressive ensembles with similar performance as $g$, but with much smaller student models. We demonstrate the effectiveness of \algA by decomposing pretrained models across standard image, speech, and sensor datasets. We also provide theoretical guarantees for our method in terms of convergence and generalization.
翻译:我们研究渐进式蒸馏问题:给定一个大型预训练教师模型 $g$,旨在将其分解为多个推理成本较低的较小学生模型 $f_i$ 组成的集成。所得集成能够灵活调节准确率与推理成本之间的权衡,适用于设备端推理的多种应用场景。我们提出的方法 B-DISTIL 基于一种算法流程,通过中间激活的函数组合来构建表达能力强的集成,其性能与 $g$ 相当,但所需的学生模型规模大幅减小。通过在标准图像、语音和传感器数据集上分解预训练模型,我们验证了 \algA 的有效性。此外,我们还从收敛性和泛化性角度给出了方法理论保证。