Training deep networks requires various design decisions regarding for instance their architecture, data augmentation, or optimization. In this work, we find these training variations to result in networks learning unique feature sets from the data. Using public model libraries comprising thousands of models trained on canonical datasets like ImageNet, we observe that for arbitrary pairings of pretrained models, one model extracts significant data context unavailable in the other -- independent of overall performance. Given any arbitrary pairing of pretrained models and no external rankings (such as separate test sets, e.g. due to data privacy), we investigate if it is possible to transfer such "complementary" knowledge from one model to another without performance degradation -- a task made particularly difficult as additional knowledge can be contained in stronger, equiperformant or weaker models. Yet facilitating robust transfer in scenarios agnostic to pretrained model pairings would unlock auxiliary gains and knowledge fusion from any model repository without restrictions on model and problem specifics - including from weaker, lower-performance models. This work therefore provides an initial, in-depth exploration on the viability of such general-purpose knowledge transfer. Across large-scale experiments, we first reveal the shortcomings of standard knowledge distillation techniques, and then propose a much more general extension through data partitioning for successful transfer between nearly all pretrained models, which we show can also be done unsupervised. Finally, we assess both the scalability and impact of fundamental model properties on successful model-agnostic knowledge transfer.
翻译:训练深度网络需要在架构、数据增强或优化等方面做出多种设计决策。本研究发现,这些训练差异会导致网络从数据中学习到独特的特征集。通过使用包含数千个在ImageNet等规范数据集上训练的模型的公共模型库,我们观察到:对于任意预训练模型配对,其中一个模型能提取到另一个模型所不具备的重要数据上下文——且这一现象与整体性能无关。在给定任意预训练模型配对且无外部排名(例如因数据隐私而无法使用独立测试集)的情况下,我们探究了是否可能在不降低性能的前提下,将此类“互补性”知识从一个模型迁移至另一个模型——由于额外知识可能存在于更强、等性能或更弱的模型中,这项任务尤为困难。然而,在忽略预训练模型配对场景中实现稳健迁移,将能从任何模型库中释放辅助增益与知识融合潜力,且不受模型与问题特性的限制——包括从性能较弱的低水平模型中获取知识。因此,本文首次深入探索了此类通用知识迁移的可行性。通过大规模实验,我们首先揭示了标准知识蒸馏技术的局限性,进而提出了一种通过数据分区实现更广泛扩展的通用方法,该方法能够实现几乎所有预训练模型间的成功迁移,且我们证明了该过程可在无监督条件下完成。最后,我们评估了基础模型属性对成功实现模型无关知识迁移的可扩展性与影响。