We study the problem of progressive ensemble distillation: Given a large, pretrained teacher model $g$, we seek to decompose the model into smaller, low-inference cost student models $f_i$, such that progressively evaluating additional models in this ensemble leads to improved predictions. The resulting ensemble allows for flexibly tuning accuracy vs. inference cost at runtime, which is useful for a number of applications in on-device inference. The method we propose, B-DISTIL , relies on an algorithmic procedure that uses function composition over intermediate activations to construct expressive ensembles with similar performance as $g$ , but with smaller student models. We demonstrate the effectiveness of B-DISTIL by decomposing pretrained models across standard image, speech, and sensor datasets. We also provide theoretical guarantees in terms of convergence and generalization.
翻译:我们研究渐进式集成蒸馏问题:给定一个大规模预训练教师模型 $g$,我们旨在将其分解为多个推理成本较低的学生模型 $f_i$,使得逐步评估该集成中的额外模型能够带来预测性能的提升。由此产生的集成可以在运行时灵活地调节准确率与推理成本,这对于设备端推理中的多种应用场景具有实用价值。我们提出的方法B-DISTIL依赖于一种算法流程,该流程利用中间激活上的函数组合来构建具有表达力的集成,其性能与 $g$ 相当,但使用的学生模型规模更小。我们通过在标准图像、语音和传感器数据集上分解预训练模型,验证了B-DISTIL的有效性。此外,我们还提供了关于收敛性和泛化性的理论保证。