Commonsense capabilities of pre-trained language models dramatically improve with scale, leading many to believe that scale is the only winning recipe. But is it? Here, we investigate an alternative that a priori seems impossible: can smaller language models (e.g., GPT-2) win over models that are orders of magnitude larger and better (e.g., GPT-3), if powered with novel commonsense distillation algorithms? The key intellectual challenge is to design a learning algorithm that achieve a competitive level of commonsense acquisition, without relying on the benefits of scale. In particular, we study generative models of commonsense knowledge, focusing on the task of generating generics, statements of commonsense facts about everyday concepts, e.g., birds can fly. We introduce I2D2, a novel commonsense distillation framework that loosely follows the Symbolic Knowledge Distillation of West et al. but breaks the dependence on the extreme-scale teacher model with two innovations: (1) the novel adaptation of NeuroLogic Decoding to enhance the generation quality of the weak, off-the-shelf language models, and (2) self-imitation learning to iteratively learn from the model's own enhanced commonsense acquisition capabilities. Empirical results suggest that scale is not the only way, as novel algorithms can be a promising alternative. Moreover, our study leads to a new corpus of generics, Gen-A-tomic, that is the largest and highest quality available to date.
翻译:预训练语言模型的常识能力随规模扩大而显著提升,这使许多人相信规模是唯一制胜法则。但果真如此吗?本文探讨了一种看似不可能的反向路径:若辅以新型常识蒸馏算法,较小规模的语言模型(如GPT-2)能否超越规模大数个量级的更优模型(如GPT-3)?核心学术挑战在于设计不依赖规模优势却能达到同等常识获取水平的学习算法。我们聚焦常识知识的生成式模型,研究"类属陈述"(generic statements)的生成任务——即关于日常概念的常识性事实表述(如"鸟会飞")。我们提出I2D2框架,该新型常识蒸馏体系虽借鉴了West等人的符号知识蒸馏架构,但通过两项创新突破了对极端规模教师模型的依赖:(1)创新性地适配神经逻辑解码技术,提升弱监督离线语言模型的生成质量;(2)采用自我模仿学习机制,使模型能从自身不断增强的常识获取能力中迭代学习。实证结果表明,规模并非唯一路径,创新算法可成为极具潜力的替代方案。此外,本研究产出了迄今规模最大、质量最高的类属知识语料库Gen-A-tomic。