Large language models (LLMs) are trained through multi-stage pipelines over heterogeneous data sources, yet developers lack a principled way to pinpoint the specific data responsible for an observed behavior. This lack of observability reduces debugging to reactive patching and makes failures prone to recur under distribution shift or subsequent model updates. To address this limitation, we propose DebugLM, a framework that equips LLMs with built-in data provenance, enabling them to explicitly trace the origins of their behaviors to specific training data sources. Specifically, the model learns to associate its responses with unique provenance tags that indicate the responsible dataset, empowering developers to precisely identify where undesirable behaviors are learned. Building on this capability, DebugLM further supports targeted test-time remediation, enabling developers to selectively trigger targeted refusal for specified data sources without retraining or modifying model parameters. Experiments demonstrate that DebugLM provides accurate behavior tracing in multi-stage training pipelines and effective test-time remediation while preserving the general utility of the model.
翻译:大语言模型通过多阶段流水线在异构数据源上进行训练,然而开发者缺乏原则性方法来精确定位导致特定行为的具体数据。这种可观测性的缺失将调试降级为被动修补,导致模型在数据分布偏移或后续更新时易于重现故障。为解决这一局限,我们提出DebugLM框架,该框架赋予大语言模型内置数据溯源能力,使其能够显式地将自身行为追溯至特定训练数据源。具体而言,模型学习将响应与表示数据来源的唯一溯源标签相关联,从而精准识别不良行为的学习源头。基于该能力,DebugLM进一步支持目标性测试时修复,使开发者无需重新训练或修改模型参数即可对特定数据源选择性触发拒答响应。实验表明,DebugLM在多阶段训练流程中实现了准确的行为追溯与有效的测试时修复,同时保持了模型的通用能力。