Evaluating LLM Safety Under Repeated Inference via Accelerated Prompt Stress Testing

Traditional benchmarks for large language models (LLMs) primarily assess safety risk through breadth-oriented evaluation across diverse tasks. However, real-world deployment exposes a different class of risk: operational failures arising from repeated inference on identical or near-identical prompts rather than broad task generalization. In high-stakes settings, response consistency and safety under sustained use are critical. We introduce Accelerated Prompt Stress Testing (APST), a depth-oriented evaluation framework inspired by reliability engineering. APST repeatedly samples identical prompts under controlled operational conditions (e.g., decoding temperature) to surface latent failure modes including hallucinations, refusal inconsistency, and unsafe completions. Rather than treating failures as isolated events, APST models them as stochastic outcomes of independent inference events. We formalize safety failures using Bernoulli and binomial models to estimate per-inference failure probabilities, enabling quantitative comparison of reliability across models and decoding configurations. Applying APST to multiple instruction-tuned LLMs evaluated on AIR-BENCH-derived safety prompts, we find that models with similar benchmark-aligned scores can exhibit substantially different empirical failure rates under repeated sampling, particularly as temperature increases. These results demonstrate that shallow, single-sample evaluation can obscure meaningful reliability differences under sustained use. APST complements existing benchmarks by providing a practical framework for evaluating LLM safety and reliability under repeated inference, bridging benchmark alignment and deployment-oriented risk assessment.

翻译：传统的大语言模型（LLM）安全基准主要通过面向广度的评估，在不同任务上衡量安全风险。然而，实际部署面临另一类风险：源于对相同或高度相似提示进行重复推理的操作性失效，而非广泛的任务泛化能力。在高风险场景中，持续使用下的响应一致性与安全性至关重要。我们提出加速提示压力测试（APST），一种受可靠性工程启发的、面向深度的评估框架。APST在受控操作条件（如解码温度）下对相同提示进行重复采样，以揭示潜在的失效模式，包括幻觉、拒绝行为不一致和不安全的补全结果。APST不将失效视为孤立事件，而是将其建模为独立推理事件的随机结果。我们使用伯努利模型和二项模型对安全失效进行形式化建模，以估计每次推理的失效概率，从而支持跨模型和解码配置的可靠性定量比较。将APST应用于多个经过指令微调的LLM，并在源自AIR-BENCH的安全提示上进行评估，我们发现，在基准测试中得分相近的模型，在重复采样下可能表现出显著不同的经验失效率，尤其是在温度升高时。这些结果表明，浅层的单样本评估可能掩盖持续使用下具有实际意义的可靠性差异。APST通过提供一个评估重复推理下LLM安全性与可靠性的实用框架，补充了现有基准测试，弥合了基准对齐与面向部署的风险评估之间的鸿沟。