Large language models (LLMs) are increasingly deployed in conversational settings where user tone ranges from polite to adversarial or toxic, yet less is known about whether toxic language in otherwise semantically equivalent prompts can degrade factual reliability. We study how lexical and tone-based prompt perturbations affect the factual reliability of LLMs. Using controlled prompt variations across polite, random, and three toxicity levels, we evaluate five LLMs on ARC-Easy, GSM8K, and MMLU. We find that toxic lexical perturbations consistently reduce factual accuracy and increase uncertainty, while polite phrasing yields limited and inconsistent changes. To examine whether these answer inconsistencies correspond to internal changes, we conduct attribution-graph analyses of model activations and influences. We find that increasing toxicity selectively amplifies perturbation-sensitive variant nodes while relatively stable core reasoning nodes remain more invariant. These findings position prompt tone as a critical dimension of LLM reliability and provide behavioral and mechanistic evidence that surface-level lexical variation can alter factual outputs and internal computation.
翻译:大型语言模型(LLMs)正越来越多地部署于对话场景中,用户语气从礼貌到对抗性或有毒不等,但关于语义等价提示中的有毒语言是否会降低事实可靠性,目前尚知之甚少。我们研究词汇与语气层面的提示扰动如何影响LLMs的事实可靠性。通过跨礼貌、随机及三种毒性水平的受控提示变体,我们在ARC-Easy、GSM8K和MMLU上评估了五个LLM。我们发现,有毒词汇扰动持续降低事实准确性并增加不确定性,而礼貌措辞仅产生有限且不一致的变化。为检验这些答案不一致性是否对应内部变化,我们进行了模型激活和影响的归因图分析。我们发现,增加毒性会选择性放大对扰动敏感的变体节点,而相对稳定的核心推理节点则保持更不变。这些发现将提示语气定位为LLM可靠性的关键维度,并提供行为与机制层面的证据,表明表层词汇变化能够改变事实输出与内部计算。