LLMSurgeon: Diagnosing Data Mixture of Large Language Models

The pretraining data mixture of Large Language Models (LLMs) constitutes their "digital DNA", shaping model behaviors, capabilities, and failure modes. Yet this composition is rarely disclosed, making post-hoc auditing of data combination or provenance difficult. In this work, we formalize $\textbf{Data Mixture Surgery (DMS)}$: given only generated text from a target LLM, estimate the domain-level distribution of its pretraining corpus under a predefined taxonomy. We propose $\textbf{LLMSurgeon}$, a strong framework that casts DMS as an inverse problem under the label-shift assumption. Rather than directly aggregating classifier outputs, LLMSurgeon estimates a calibrated $\textit{soft}$ confusion matrix and solves a constrained inverse problem to correct systematic domain confusion and recover the latent mixture prior. To evaluate, we introduce $\textbf{LLMScan}$, a recipe-verifiable evaluation suite built from open-source LLMs with transparent pretraining mixtures. Across LLMScan, LLMSurgeon recovers domain mixtures with high fidelity under fixed protocols. Our work presents a practical, post-hoc approach for auditing the digital DNA of foundation models without access to their training data.

翻译：大语言模型（LLM）的预训练数据混合构成了其“数字DNA”，塑造了模型的行为、能力与失效模式。然而，这一组成极少被公开，使得事后审计数据的组合或来源变得困难。在本工作中，我们形式化了$\textbf{数据混合手术（Data Mixture Surgery, DMS）}$：仅根据目标LLM生成的文本，在预定义分类体系下估算其预训练语料的领域级分布。我们提出$\textbf{LLMSurgeon}$，一个将DMS建模为标签偏移假设下反问题的强健框架。LLMSurgeon不直接聚合分类器输出，而是估计校准后的$\textit{软}$混淆矩阵，并求解带约束的反问题以修正系统性领域混淆并恢复潜在的混合先验。为进行评估，我们引入$\textbf{LLMScan}$，一个基于预训练混合透明的开源LLM构建的配方可验证评估套件。在LLMScan上，LLMSurgeon在固定协议下以高保真度恢复领域混合。我们的工作提供了一种无需访问训练数据即可事后审计基础模型数字DNA的实用方法。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

ACL 2026 | LLMSurgeon：从生成文本诊断大模型训练数据

专知会员服务

10+阅读 · 6月2日

【新书】使用大型语言模型进行数据分析：文本、表格、图像与音频

专知会员服务

43+阅读 · 2025年4月16日

LLM后训练：深入探讨推理大语言模型

专知会员服务

40+阅读 · 2025年3月3日

《大语言模型的数据合成与增强综述》

专知会员服务

44+阅读 · 2024年10月19日