We study multiteacher knowledge distillation for low resource abstractive summarization from a reliability aware perspective. We introduce EWAD (Entropy Weighted Agreement Aware Distillation), a token level mechanism that routes supervision between teacher distillation and gold supervision based on inter teacher agreement, and CPDP (Capacity Proportional Divergence Preservation), a geometric constraint on the student position relative to heterogeneous teachers. Across two Bangla datasets, 13 BanglaT5 ablations, and eight Qwen2.5 experiments, we find that logit level KD provides the most reliable gains, while more complex distillation improves semantic similarity for short summaries but degrades longer outputs. Cross lingual pseudo label KD across ten languages retains 71-122 percent of teacher ROUGE L at 3.2x compression. A human validated multi judge LLM evaluation further reveals calibration bias in single judge pipelines. Overall, our results show that reliability aware distillation helps characterize when multi teacher supervision improves summarization and when data scaling outweighs loss engineering.
翻译:我们从可靠性感知的角度研究面向低资源抽象摘要的多教师知识蒸馏。我们提出了EWAD(熵加权一致性感知蒸馏),一种基于教师之间一致性的令牌级机制,用于在教师蒸馏与金标准监督之间路由监督信号;以及CPDP(容量比例散度保持),一种关于学生模型相对于异构教师位置的几何约束。在两个孟加拉语数据集、13个BanglaT5消融实验和八个Qwen2.5实验中,我们发现logit级知识蒸馏提供了最可靠的性能提升,而更复杂的蒸馏方法能改善短摘要的语义相似度,但会降低较长输出的质量。跨语言的伪标签知识蒸馏在十种语言上,以3.2倍压缩率保留了教师ROUGE-L的71%至122%。一项经人工验证的多裁判大语言模型评估进一步揭示了单裁判流程中的校准偏差。总体而言,我们的结果表明,可靠性感知蒸馏有助于刻画多教师监督何时能改善摘要生成,以及数据规模扩展何时比损失工程更为重要。