Recent work has shown that fine-tuning large language models (LLMs) on insecure code or culturally loaded numeric codes can induce emergent misalignment, causing models to produce harmful content in unrelated downstream tasks. The authors of that work concluded that $k$-shot prompting alone does not induce this effect. We revisit this conclusion and show that inference-time semantic drift is real and measurable; however, it requires models of large-enough capability. Using a controlled experiment in which five culturally loaded numbers are injected as few-shot demonstrations before a semantically unrelated prompt, we find that models with richer cultural-associative representations exhibit significant distributional shifts toward darker, authoritarian, and stigmatized themes, while a simpler/smaller model does not. We additionally find that structurally inert demonstrations (nonsense strings) perturb output distributions, suggesting two separable mechanisms: structural format contamination and semantic content contamination. Our results map the boundary conditions under which inference-time contamination occurs, and carry direct implications for the security of LLM-based applications that use few-shot prompting.
翻译:近期研究表明,在存在安全缺陷的代码或文化负载数字编码上微调大型语言模型(LLMs)会引发隐现的错位,导致模型在不相关的下游任务中生成有害内容。该研究的作者得出结论,仅凭$k$-样本提示无法诱发这一效应。我们重新审视这一结论,并证明推理时语义漂移是真实且可测量的;然而,这需要具备足够能力的模型。通过一项控制实验,在语义无关的提示前注入五个文化负载数字作为少样本演示,我们发现,具有更丰富文化联想表征的模型会显著转向黑暗、专制和污名化主题,而更简单/更小的模型则不会。此外,我们发现结构惰性演示(无意义字符串)会扰动输出分布,这表明存在两种可分机制:结构格式污染和语义内容污染。我们的结果描绘了推理时污染发生的边界条件,并对基于少样本提示的LLM应用安全具有直接意义。