Distillation transfers knowledge from a large model trained on broad data to a smaller, more efficient model suitable for deployment. In structured prediction settings, prior knowledge about the task can guide the choice of a target architecture that is algorithmically aligned with the underlying problem. Building on recent learning-theoretic analyses of decision-tree (DT) distillation (Boix-Adsera, 2024), we study when distillation succeeds for combinatorial optimization tasks. We focus on the case where the target model is a graph neural network whose architecture is aligned with a dynamic programming (DP) algorithm for the task. Assuming that the source model is sufficiently rich, formalized through the linear representation hypothesis (LRH) (Elhage et al., 2022; Park et al., 2024), we show that the distillation problem can be solved efficiently in the complexity parameters of the DP transition function, represented as a DT. Our results provide a rigorous sufficient condition for successful distillation in the flavour of algorithmic alignment.
翻译:蒸馏技术将在大规模数据上训练的庞大模型知识迁移至更小、更高效的部署模型。在结构化预测场景中,关于任务的先验知识可以指导选择与底层问题算法对齐的目标架构。基于近期决策树蒸馏的学习理论分析(Boix-Adséra, 2024),我们研究了组合优化任务中蒸馏成功的条件。重点聚焦于目标模型为图神经网络且其架构与任务动态规划算法对齐的情形。假设源模型足够丰富(通过线性表示假说形式化,Elhage等人,2022;Park等人,2024),我们证明蒸馏问题可在动态规划转移函数的复杂度参数(以决策树表示)范围内高效求解。我们的结果为算法对齐框架下的成功蒸馏提供了严谨的充分条件。