Building on the cost-efficient pretraining advancements brought about by Crammed BERT, we enhance its performance and interpretability further by introducing a novel pretrained model Dependency Agreement Crammed BERT (DACBERT) and its two-stage pretraining framework - Dependency Agreement Pretraining. This framework, grounded by linguistic theories, seamlessly weaves syntax and semantic information into the pretraining process. The first stage employs four dedicated submodels to capture representative dependency agreements at the chunk level, effectively converting these agreements into embeddings. The second stage uses these refined embeddings, in tandem with conventional BERT embeddings, to guide the pretraining of the rest of the model. Evaluated on the GLUE benchmark, our DACBERT demonstrates notable improvement across various tasks, surpassing Crammed BERT by 3.13% in the RTE task and by 2.26% in the MRPC task. Furthermore, our method boosts the average GLUE score by 0.83%, underscoring its significant potential. The pretraining process can be efficiently executed on a single GPU within a 24-hour cycle, necessitating no supplementary computational resources or extending the pretraining duration compared with the Crammed BERT. Extensive studies further illuminate our approach's instrumental role in bolstering the interpretability of pretrained language models for natural language understanding tasks.
翻译:基于Crammed BERT在成本高效预训练方面取得的进展,我们通过引入新型预训练模型Dependency Agreement Crammed BERT(DACBERT)及其两阶段预训练框架——依存一致性预训练,进一步提升了其性能与可解释性。该框架以语言学理论为基础,将句法和语义信息无缝融入预训练过程。第一阶段采用四个专用子模型在语块层面捕获代表性依存一致性,并高效地将这些一致性转化为嵌入表示;第二阶段使用这些精炼嵌入表示,与传统BERT嵌入协同作用,指导模型其余部分的预训练。在GLUE基准测试上的评估表明,DACBERT在多项任务中表现显著提升,在RTE和MRPC任务上分别超越Crammed BERT达3.13%和2.26%。此外,我们的方法将GLUE平均得分提高0.83%,充分彰显其巨大潜力。预训练过程可在单个GPU上于24小时内高效完成,与Crammed BERT相比无需额外计算资源或延长预训练时间。大量研究进一步揭示,该方法在增强面向自然语言理解任务的预训练语言模型可解释性方面发挥了关键作用。