Incorporating metadata in Large Language Models (LLMs) pretraining has recently emerged as a promising approach to accelerate training. However prior work highlighted only one useful signal-URLs, leaving open the question of whether other forms of metadata could yield greater benefits. In this study, we investigate a wider range of metadata types and find other types of metadata, such as fine-grained indicators of document quality that can also accelerate pretraining when prepended. We identify a common feature among effective metadata: they encode information at a finer granularity. We further introduce metadata appending as a means of improving training efficiency, where predicting an appropriate metadata as auxiliary task can help speed up pretraining. In addition, learnable meta-tokens trained with masked loss can recover part of the speedup by inducing quality-aware latent structure. Using probing, we analyze latent representations to understand how metadata shapes learning. Together, these results yield practical guidelines for integrating metadata to improve both the efficiency and effectiveness of LLM pretraining.
翻译:在大型语言模型(LLM)预训练中融入元数据近期已成为一种加速训练的有效方法。然而,先前研究仅强调了URL这一种有效信号,尚未明确其他形式的元数据能否带来更大效益。本研究探索了更广泛的元数据类型,发现诸如文档质量的细粒度指标等其他元数据在作为前缀添加时同样能加速预训练。我们识别出有效元数据的共同特征:它们以更细粒度编码信息。进一步,我们引入元数据后缀附加作为提升训练效率的手段——通过预测适当元数据作为辅助任务来加速预训练。此外,采用掩码损失训练的可学习元标记能够通过诱导质量感知的潜在结构部分恢复加速效果。通过探针分析,我们解析了潜在表征以理解元数据如何塑造学习过程。综合这些结果,我们为整合元数据以提升LLM预训练的效率和有效性提供了实用指南。