Current cultural alignment approaches focus on inference-time interventions, assuming models already contain sufficient cultural knowledge. We argue modern LLM pipelines suffer from a cultural data funnel. Using a multidimensional tagging framework across pretraining, fine-tuning, alignment, and reasoning datasets, we show explicit cultural signals decline sharply during post-training, while geographically concentrated, task-specialized data dominates. Multilinguality enhances geographic diversity of cultural knowledge but does not ensure balanced representation. Our tags improve downstream cultural benchmark performance, demonstrating that advances require shifting focus in training data pipelines. To facilitate future research, we release our culturally tagged dataset with 5.6M samples at https://huggingface.co/datasets/CohereLabs/CultureMarkers.
翻译:当前文化对齐方法主要聚焦于推理阶段的干预措施,其隐含假设是模型已具备足够的文化知识。我们论证现代大语言模型流程存在文化数据漏斗问题。通过构建一个涵盖预训练、微调、对齐和推理数据集的多维标签框架,我们揭示了以下现象:在模型后训练阶段,显性文化信号急剧衰减,而地理集中、任务专业化的数据占据主导地位。多语言能力虽能增强文化知识的地理多样性,却无法确保表征的平衡性。我们提出的标签体系有效提升了下游文化基准测试性能,表明实质性进展需要将关注重心转向训练数据流程。为促进后续研究,我们在https://huggingface.co/datasets/CohereLabs/CultureMarkers 开源了包含560万个样本的文化标注数据集。