Large Language Models (LLMs) can acquire unintended biases from seemingly benign training data even without explicit cues or malicious content. Existing methods struggle to detect such risks before fine-tuning, making post hoc evaluation costly and inefficient. To address this challenge, we introduce Data2Behavior, a new task for predicting unintended model behaviors prior to training. We also propose Manipulating Data Features (MDF), a lightweight approach that summarizes candidate data through their mean representations and injects them into the forward pass of a base model, allowing latent statistical signals in the data to shape model activations and reveal potential biases and safety risks without updating any parameters. MDF achieves reliable prediction while consuming only about 20% of the GPU resources required for fine-tuning. Experiments on Qwen3-14B, Qwen2.5-32B-Instruct, and Gemma-3-12b-it confirm that MDF can anticipate unintended behaviors and provide insight into pre-training vulnerabilities.
翻译:大型语言模型(LLMs)即使在没有显式线索或恶意内容的情况下,也可能从看似良性的训练数据中习得意外偏见。现有方法难以在微调前检测此类风险,导致事后评估成本高昂且效率低下。为解决这一挑战,我们提出了Data2Behavior这一新任务,旨在训练前预测模型的意外行为。同时,我们提出了一种轻量级方法——操纵数据特征(MDF),该方法通过候选数据的均值表示进行数据摘要,并将其注入基础模型的前向传播过程中,使数据中的潜在统计信号能够塑造模型激活,从而在不更新任何参数的情况下揭示潜在偏见与安全风险。MDF在仅消耗微调所需约20%GPU资源的情况下实现了可靠的预测。在Qwen3-14B、Qwen2.5-32B-Instruct和Gemma-3-12b-it上的实验证实,MDF能够有效预测意外行为,并为预训练脆弱性分析提供洞见。