Feature-based knowledge distillation aims to transfer intermediate representations from a teacher LLM model to a student. Existing approaches typically rely on direct feature matching or learned projections, implicitly treating representations as objects with intrinsic meaning. However, the relevance of a representation dimension is determined solely by how it affects the model's output. In this work, we propose a functional perspective on feature-based distillation. We characterize knowledge transfer in terms of the teacher's functional geometry, i.e., how its output depends on internal representations, rather than direct representation alignment. This viewpoint reveals that effective distillation need not preserve full high-dimensional features, but instead should retain dominant directions of functional contribution, naturally inducing an effective functional dimension for each task. Building on this framework, we introduce Flex-KD, an architecture-agnostic and parameter-free distillation method that transfers the teacher's functional geometry while matching the student's representational capacity. Extensive experiments across language understanding and generation benchmarks demonstrate that Flex-KD consistently outperforms existing distillation approaches, particularly under severe teacher-student dimension mismatch.
翻译:基于特征的知识蒸馏旨在将教师大型语言模型中的中间表示迁移至学生模型。现有方法通常依赖直接特征匹配或学习投影,隐含地将表征视为具有内在意义的对象。然而,表征维度的相关性完全由其如何影响模型输出决定。本研究提出一种基于特征蒸馏的功能性视角。我们通过教师模型的功能几何特性——即其输出如何依赖内部表征——来刻画知识迁移,而非直接进行表征对齐。这一视角揭示:有效的蒸馏无需保留完整的高维特征,而应保留功能贡献的主导方向,这自然地为每个任务诱导出有效的功能维度。基于此框架,我们提出Flex-KD——一种与架构无关且无需参数的蒸馏方法,该方法在匹配学生模型表征能力的同时,迁移教师模型的功能几何特性。在语言理解与生成基准上的大量实验表明,Flex-KD始终优于现有蒸馏方法,尤其在教师-学生模型维度严重不匹配的情况下表现突出。