Most language model pre-training frameworks concatenate multiple documents into fixed-length sequences and use causal masking to compute the likelihood of each token given its context; this strategy is widely adopted due to its simplicity and efficiency. However, to this day, the influence of the pre-training sequence composition strategy on the generalisation properties of the model remains under-explored. In this work, we find that applying causal masking can lead to the inclusion of distracting information from previous documents during pre-training, which negatively impacts the performance of the models on language modelling and downstream tasks. In intra-document causal masking, the likelihood of each token is only conditioned on the previous tokens in the same document, eliminating potential distracting information from previous documents and significantly improving performance. Furthermore, we find that concatenating related documents can reduce some potential distractions during pre-training, and our proposed efficient retrieval-based sequence construction method, BM25Chunk, can improve in-context learning (+11.6\%), knowledge memorisation (+9.8\%), and context utilisation (+7.2\%) abilities of language models without sacrificing efficiency.
翻译:大多数语言模型预训练框架将多个文档拼接为固定长度的序列,并采用因果掩码计算每个token在其上下文中的似然概率;该策略因其简单高效而被广泛采用。然而,预训练序列构成策略对模型泛化性能的影响至今仍未被充分探索。本研究发现,应用因果掩码可能导致预训练过程中混入前序文档的干扰信息,从而对语言建模和下游任务的性能产生负面影响。在文档内因果掩码中,每个token的似然仅依赖于同一文档中的前序token,消除了前序文档可能带来的干扰信息,并显著提升了性能。此外,研究发现拼接相关文档可减少预训练过程中的潜在干扰,所提出的基于高效检索的序列构建方法BM25Chunk,能在不牺牲效率的前提下提升语言模型的上下文学习(+11.6%)、知识记忆(+9.8%)和上下文利用(+7.2%)能力。