Beyond URLs: Metadata Diversity and Position for Efficient LLM Pretraining

Incorporating metadata in Large Language Models (LLMs) pretraining has recently emerged as a promising approach to accelerate training. However prior work highlighted only one useful signal-URLs, leaving open the question of whether other forms of metadata could yield greater benefits. In this study, we investigate a wider range of metadata types and find other types of metadata, such as fine-grained indicators of document quality that can also accelerate pretraining when prepended. We identify a common feature among effective metadata: they encode information at a finer granularity. We further introduce metadata appending as a means of improving training efficiency, where predicting an appropriate metadata as auxiliary task can help speed up pretraining. In addition, learnable meta-tokens trained with masked loss can recover part of the speedup by inducing quality-aware latent structure. Using probing, we analyze latent representations to understand how metadata shapes learning. Together, these results yield practical guidelines for integrating metadata to improve both the efficiency and effectiveness of LLM pretraining.

翻译：在大型语言模型（LLM）预训练中融入元数据近期已成为一种加速训练的有效方法。然而，先前研究仅强调了URL这一种有效信号，尚未明确其他形式的元数据能否带来更大效益。本研究探索了更广泛的元数据类型，发现诸如文档质量的细粒度指标等其他元数据在作为前缀添加时同样能加速预训练。我们识别出有效元数据的共同特征：它们以更细粒度编码信息。进一步，我们引入元数据后缀附加作为提升训练效率的手段——通过预测适当元数据作为辅助任务来加速预训练。此外，采用掩码损失训练的可学习元标记能够通过诱导质量感知的潜在结构部分恢复加速效果。通过探针分析，我们解析了潜在表征以理解元数据如何塑造学习过程。综合这些结果，我们为整合元数据以提升LLM预训练的效率和有效性提供了实用指南。

相关内容

元数据

关注 7

元数据（Metadata），又称元数据、中介数据、中继数据[来源请求]，为描述数据的数据（data about data），主要是描述数据属性（property）的信息，用来支持如指示存储位置、历史数据、资源查找、文件纪录等功能。元数据算是一种电子式目录，为了达到编制目录的目的，必须在描述并收藏数据的内容或特色，进而达成协助数据检索的目的。

LLM/智能体作为数据分析师：综述

专知会员服务

38+阅读 · 2025年9月30日

强化学习遇见大语言模型：贯穿 LLM 生命周期的进展与应用综述

专知会员服务

38+阅读 · 2025年9月23日

LLM后训练：深入探讨推理大语言模型

专知会员服务

40+阅读 · 2025年3月3日

Llama-3-SynE：实现有效且高效的大语言模型持续预训练

专知会员服务

36+阅读 · 2024年7月30日