In recent years, a significant number of high-quality pretrained models have emerged, greatly impacting Natural Language Understanding (NLU), Natural Language Generation (NLG), and Text Representation tasks. Traditionally, these models are pretrained on custom domain corpora and finetuned for specific tasks, resulting in high costs related to GPU usage and labor. Unfortunately, recent trends in language modeling have shifted towards enhancing performance through scaling, further exacerbating the associated costs. Introducing GUR: a pretraining framework that combines language modeling and contrastive learning objectives in a single training step. We select similar text pairs based on their Longest Common Substring (LCS) from raw unlabeled documents and train the model using masked language modeling and unsupervised contrastive learning. The resulting model, GUR, achieves impressive results without any labeled training data, outperforming all other pretrained baselines as a retriever at the recall benchmark in a zero-shot setting. Additionally, GUR maintains its language modeling ability, as demonstrated in our ablation experiment. Our code is available at \url{https://github.com/laohur/GUR}.
翻译:近年来,大量高质量预训练模型涌现,极大影响了自然语言理解、自然语言生成及文本表征任务。传统上,这些模型在特定领域语料库上进行预训练,并针对具体任务进行微调,导致GPU使用和人力成本高昂。不幸的是,近期语言建模趋势转向通过扩展规模来提升性能,进一步加剧了相关成本。本文提出GUR:一种将语言建模与对比学习目标融合于单一训练步骤的预训练框架。我们从原始无标注文档中基于最长公共子串选取相似文本对,并利用掩码语言建模与无监督对比学习训练模型。由此得到的GUR模型无需任何标注训练数据即取得显著成果,在零样本场景中作为检索器超越所有其他预训练基线模型的召回基准。此外,消融实验证明GUR保留了语言建模能力。我们的代码开源在\url{https://github.com/laohur/GUR}。