Large language models (LLMs) increasingly operate on long inputs, yet their behavior when harmful sentences are sparsely embedded within such inputs remains poorly understood. We present a sensitivity analysis that probes how LLMs extract harmful sentences embedded in long inputs. We construct long inputs by combining neutral and harmful sentences, and systematically vary four factors: input length (600--30,000 tokens), the proportion of harmful sentences (0.01--0.50), harm realization (explicit vs. implicit), and the position of harmful sentences within the input (beginning, middle, end), enabling a controlled stress-test evaluation. Experiments across toxic, offensive, and hate content, and across LLaMA-3.1, Qwen-2.5, and Mistral, reveal consistent patterns: sensitivity is non-monotonic with respect to harmful prevalence, peaking at moderate levels; sensitivity degrades as input length increases; harmful sentences placed earlier in the input are more strongly prioritized; and explicit harm is more reliably identified than implicit harm. These findings provide a systematic view of how LLMs prioritize harmful sentences in long input under controlled stress conditions, highlighting both emerging strengths and remaining challenges for safety-related use.
翻译:大语言模型(LLMs)越来越多地处理长文本输入,但人们对其在有害句子稀疏嵌入其中的行为仍知之甚少。本文提出一种敏感性分析方法,探究LLMs如何从长输入中提取嵌入的有害句子。我们通过组合中性句子与有害句子构建长输入,并系统调控四个因素:输入长度(600–30,000个词元)、有害句子比例(0.01–0.50)、危害实现方式(显性vs隐性)以及有害句子在输入中的位置(开头、中间、结尾),从而实现可控的应力测试评估。针对毒害性、攻击性和仇恨性三类内容,在LLaMA-3.1、Qwen-2.5和Mistral等模型上的实验揭示了统一模式:敏感性随有害比例呈非单调变化,在中等级别达到峰值;敏感性随输入长度增加而下降;位于输入前部的有害句子获得更强的处理优先级;显性危害比隐性危害更易被可靠识别。这些发现系统揭示了LLMs在受控应力条件下如何优先处理长输入中的有害句子,凸显了安全相关应用中的新兴能力与现存挑战。