We present Impossible Distillation, a novel framework for paraphrasing and sentence summarization, that distills a high-quality dataset and model from a low-quality teacher that itself cannot perform these tasks. Unlike prior works that rely on an extreme-scale teacher model (e.g., GPT3) or task-specific architecture, we hypothesize and verify the paraphrastic proximity intrinsic to pre-trained LMs (e.g., GPT2), where paraphrases occupy a proximal subspace in the LM distribution. By identifying and distilling generations from these subspaces, Impossible Distillation produces a high-quality dataset and model even from GPT2-scale LMs. We evaluate our method on multiple benchmarks spanning unconstrained / syntax-controlled paraphrase generation and sentence summarization. Our model with 770M parameters consistently outperforms strong baselines, including models distilled from ChatGPT, and sometimes, even ChatGPT itself. Also, we find that our distilled dataset from 1.5B LMs exhibits higher diversity and fidelity than up to 13 times larger datasets.
翻译:我们提出“不可能蒸馏”这一新颖框架,用于释义和句子摘要任务,该框架从自身无法执行这些任务的低质量教师模型中蒸馏出高质量数据集与模型。与先前依赖极端规模教师模型(如GPT3)或任务特定架构的研究不同,我们假设并验证了预训练语言模型(如GPT2)内在的释义接近性——即释义在语言模型分布中占据一个近邻子空间。通过识别并蒸馏这些子空间中的生成结果,“不可能蒸馏”甚至能从GPT2规模的语言模型中产生高质量数据集与模型。我们在无约束/语法控制释义生成和句子摘要等多个基准上评估了该方法。拥有7.7亿参数的模型持续优于强基线,包括从ChatGPT蒸馏的模型,有时甚至超越ChatGPT本身。此外,我们发现从15亿参数语言模型中蒸馏得到的数据集,其多样性和忠实度均显著优于规模大至13倍的数据集。