We propose activation-based data attribution, a method that traces behavioral changes in post-trained language models to responsible training datapoints. By computing activation-difference vectors for both test prompts and preference pairs and ranking by cosine similarity, we identify datapoints that cause specific behaviors and validate these attributions causally by retraining with modified data. Clustering behavior-datapoint similarity matrices also enables unsupervised discovery of emergent behaviors. Applying this to OLMo 2's production DPO training, we surfaced distractor-triggered compliance: a harmful behavior where the model complies with dangerous requests when benign formatting instructions are appended. Filtering top-ranked datapoints reduces this behavior by 63% while switching their labels achieves 78%. Our method outperforms gradient-based attribution and LLM-judge baselines while being over 10 times cheaper than both. This in-the-wild model organism - emerging from contaminated preference data rather than deliberate injection - provides a realistic benchmark for safety techniques.
翻译:我们提出基于激活的数据归因方法,该方法将后训练语言模型中的行为变化追溯至相应的训练数据点。通过计算测试提示和偏好对的激活差异向量,并依据余弦相似度进行排序,我们识别出导致特定行为的数据点,并通过修改数据重新训练来因果验证这些归因。对行为-数据点相似度矩阵进行聚类还能实现涌现行为的无监督发现。将此方法应用于OLMo 2的生产DPO训练后,我们发现了干扰触发的顺从行为:一种有害行为,即当附加良性格式指令时,模型会遵从危险请求。过滤排名最高的数据点可将该行为减少63%,而切换其标签则可实现78%的改善。我们的方法优于基于梯度的归因和LLM评判基线,同时成本比两者低10倍以上。这种野外模型有机体——源自受污染的偏好数据而非人为注入——为安全技术提供了现实的基准。