Current natural language understanding (NLU) models have been continuously scaling up, both in terms of model size and input context, introducing more hidden and input neurons. While this generally improves performance on average, the extra neurons do not yield a consistent improvement for all instances. This is because some hidden neurons are redundant, and the noise mixed in input neurons tends to distract the model. Previous work mainly focuses on extrinsically reducing low-utility neurons by additional post- or pre-processing, such as network pruning and context selection, to avoid this problem. Beyond that, can we make the model reduce redundant parameters and suppress input noise by intrinsically enhancing the utility of each neuron? If a model can efficiently utilize neurons, no matter which neurons are ablated (disabled), the ablated submodel should perform no better than the original full model. Based on such a comparison principle between models, we propose a cross-model comparative loss for a broad range of tasks. Comparative loss is essentially a ranking loss on top of the task-specific losses of the full and ablated models, with the expectation that the task-specific loss of the full model is minimal. We demonstrate the universal effectiveness of comparative loss through extensive experiments on 14 datasets from 3 distinct NLU tasks based on 5 widely used pretrained language models and find it particularly superior for models with few parameters or long input.
翻译:当前自然语言理解(NLU)模型在模型规模和输入上下文方面持续扩展,引入了更多隐藏神经元和输入神经元。尽管这通常能提升平均性能,但额外神经元并未对所有实例带来一致改进。这是因为部分隐藏神经元存在冗余,而输入神经元中混合的噪声易干扰模型。以往研究主要通过额外后处理或预处理(如网络剪枝和上下文选择)从外部减少低效用神经元,以规避该问题。然而,我们能否让模型通过内在增强每个神经元的效用,自主减少冗余参数并抑制输入噪声?若模型能高效利用神经元,则无论哪些神经元被消融(禁用),消融后的子模型性能都应不优于原始完整模型。基于这一模型间比较原则,我们提出了一种适用于广泛任务的跨模型对比损失。对比损失本质上是一种排序损失,叠加在完整模型与消融模型的任务特定损失之上,预期完整模型的任务特定损失最小。我们通过在基于5个广泛使用的预训练语言模型的3个不同NLU任务中的14个数据集上的大量实验,验证了对比损失的普适有效性,并发现其对参数较少或输入较长的模型尤为优越。