Metadata Conditioned Large Language Models for Localization

Large language models are typically trained by treating text as a single global distribution, often resulting in geographically homogenized behavior. We study metadata conditioning as a lightweight approach for localization, pre-training 31 models (at 0.5B and 1B parameter scales) from scratch on large-scale English news data annotated with verified URLs, country tags, and continent tags, covering 4 continents and 17 countries. Across four controlled experiments, we show that metadata conditioning consistently improves in-region performance without sacrificing cross-region generalization, enables global models to recover localization comparable to region-specific models, and improves learning efficiency. Our ablation studies demonstrate that URL-level metadata alone captures much of the geographic signal, while balanced regional data coverage remains essential, as metadata cannot fully compensate for missing regions. Finally, we introduce a downstream benchmark of 800 localized news MCQs and show that after instruction tuning, metadata conditioned global models achieve accuracy comparable to LLaMA-3.2-1B-Instruct, despite being trained on substantially less data. Together, these results establish metadata conditioning as a practical and compute-efficient approach for localization of language models.

翻译：大型语言模型通常通过将文本视为单一全局分布进行训练，这往往导致地理上的同质化行为。本研究探讨元数据条件化作为一种轻量级本地化方法，基于带有已验证URL、国家标签和大陆标签的大规模英文新闻数据，从头开始预训练了31个模型（参数规模为0.5B和1B），覆盖4个大洲和17个国家。通过四项对照实验，我们证明元数据条件化能够持续提升区域内性能而不牺牲跨区域泛化能力，使全局模型能够达到与区域专用模型相当的本地化效果，并提高学习效率。消融研究表明，仅URL层级的元数据即可捕获大部分地理信号，而均衡的区域数据覆盖仍然至关重要，因为元数据无法完全补偿缺失区域的影响。最后，我们构建了一个包含800道本地化新闻多选题的下游基准测试，结果表明经过指令微调后，元数据条件化的全局模型在训练数据量显著更少的情况下，达到了与LLaMA-3.2-1B-Instruct相当的准确率。这些结果共同确立了元数据条件化作为一种实用且计算高效的语言模型本地化方法。

相关内容

元数据

关注 7

元数据（Metadata），又称元数据、中介数据、中继数据[来源请求]，为描述数据的数据（data about data），主要是描述数据属性（property）的信息，用来支持如指示存储位置、历史数据、资源查找、文件纪录等功能。元数据算是一种电子式目录，为了达到编制目录的目的，必须在描述并收藏数据的内容或特色，进而达成协助数据检索的目的。

用于单元测试生成的大型语言模型：成果、挑战与未来方向

专知会员服务

17+阅读 · 2025年11月27日

【斯坦福博士论文】具备检索增强与条件计算能力的语言模型

专知会员服务

15+阅读 · 2025年7月4日

【新书】使用大型语言模型进行数据分析：文本、表格、图像与音频

专知会员服务

43+阅读 · 2025年4月16日

大规模语言模型的个性化：综述

专知会员服务

43+阅读 · 2024年11月4日