Language Models pretrained on large textual data have been shown to encode different types of knowledge simultaneously. Traditionally, only the features from the last layer are used when adapting to new tasks or data. We put forward that, when using or finetuning deep pretrained models, intermediate layer features that may be relevant to the downstream task are buried too deep to be used efficiently in terms of needed samples or steps. To test this, we propose a new layer fusion method: Depth-Wise Attention (DWAtt), to help re-surface signals from non-final layers. We compare DWAtt to a basic concatenation-based layer fusion method (Concat), and compare both to a deeper model baseline -- all kept within a similar parameter budget. Our findings show that DWAtt and Concat are more step- and sample-efficient than the baseline, especially in the few-shot setting. DWAtt outperforms Concat on larger data sizes. On CoNLL-03 NER, layer fusion shows 3.68--9.73% F1 gain at different few-shot sizes. The layer fusion models presented significantly outperform the baseline in various training scenarios with different data sizes, architectures, and training constraints.
翻译:在大规模文本数据上预训练的语言模型已被证明能够同时编码多种类型的知识。传统上,在适应新任务或新数据时仅使用最后一层的特征。我们提出,在使用或微调深度预训练模型时,与下游任务相关的中间层特征因埋藏过深,无法在所需样本量或训练步数方面被高效利用。为验证这一点,我们提出了一种新的层级融合方法:深度注意力(DWAtt),以帮助重新提取非最终层的信号。我们将DWAtt与基于拼接的基础层级融合方法(Concat)进行对比,并将两者均纳入相近的参数量预算内与更深层模型基线进行比较。研究结果表明,DWAtt和Concat在训练步数和样本效率方面均优于基线模型,尤其在少样本场景中表现突出。在更大数据规模下,DWAtt性能优于Concat。在CoNLL-03命名实体识别任务中,不同少样本设置下层融合方法实现了3.68%-9.73%的F1值提升。本研究所提出的层级融合模型在不同数据规模、架构和训练约束条件的多种训练场景中均显著优于基线模型。