Large language models (LLMs) can exhibit concept-conditioned semantic divergence: common high-level cues (e.g., ideologies, public figures) elicit unusually uniform, stance-like responses that evade token-trigger audits. This behavior falls in a blind spot of current safety evaluations, yet carries major societal stakes, as such concept cues can steer content exposure at scale. We formalize this phenomenon and present RAVEN (Response Anomaly Vigilance), a black-box audit that flags cases where a model is simultaneously highly certain and atypical among peers by coupling semantic entropy over paraphrastic samples with cross-model disagreement. In a controlled LoRA fine-tuning study, we implant a concept-conditioned stance using a small biased corpus, demonstrating feasibility without rare token triggers. Auditing five LLM families across twelve sensitive topics (360 prompts per model) and clustering via bidirectional entailment, RAVEN surfaces recurrent, model-specific divergences in 9/12 topics. Concept-level audits complement token-level defenses and provide a practical early-warning signal for release evaluation and post-deployment monitoring against propaganda-like influence.
翻译:大型语言模型(LLMs)可能表现出概念条件化的语义分歧现象:常见的高层语义线索(如意识形态、公众人物)会引发异常统一且具有立场倾向性的响应,这类响应能够规避基于触发词的安全审计。该行为处于当前安全评估体系的盲区,却具有重大的社会影响,因为此类概念线索能够大规模地引导内容暴露。我们形式化描述了这一现象,并提出RAVEN(响应异常监测系统)——一种黑盒审计方法,该方法通过耦合释义样本的语义熵与跨模型分歧度,来识别模型在特定情境下同时表现出高度确定性与同行间非典型性的案例。在可控的LoRA微调实验中,我们利用小型偏见语料库植入了一个概念条件化的立场,证明了无需罕见触发词即可实现该效果的可能性。通过对五个LLM系列在十二个敏感主题(每个模型360条提示)进行审计,并基于双向蕴涵关系进行聚类分析,RAVEN在9/12的主题中揭示了重复出现的、模型特定的语义分歧。概念级审计与词元级防御形成互补,为发布评估和部署后监测提供了对抗宣传式影响的实用预警信号。