对抗攻击下大语言模型鲁棒性的生存分析：不一致性发生时间 (Time-To-Inconsistency: A Survival Analysis of Large Language Model Robustness to Adversarial Attacks)

Large Language Models (LLMs) have revolutionized conversational AI, yet their robustness in extended multi-turn dialogues remains poorly understood. Existing evaluation frameworks focus on static benchmarks and single-turn assessments, failing to capture the temporal dynamics of conversational degradation that characterize real-world interactions. In this work, we present a large-scale survival analysis of conversational robustness, modeling failure as a time-to-event process over 36,951 turns from 9 state-of-the-art LLMs on the MT-Consistency benchmark. Our framework combines Cox proportional hazards, Accelerated Failure Time (AFT), and Random Survival Forest models with simple semantic drift features. We find that abrupt prompt-to-prompt semantic drift sharply increases the hazard of inconsistency, whereas cumulative drift is counterintuitively \emph{protective}, suggesting adaptation in conversations that survive multiple shifts. AFT models with model-drift interactions achieve the best combination of discrimination and calibration, and proportional hazards checks reveal systematic violations for key drift covariates, explaining the limitations of Cox-style modeling in this setting. Finally, we show that a lightweight AFT model can be turned into a turn-level risk monitor that flags most failing conversations several turns before the first inconsistent answer while keeping false alerts modest. These results establish survival analysis as a powerful paradigm for evaluating multi-turn robustness and for designing practical safeguards for conversational AI systems.

翻译：大语言模型（LLMs）已彻底改变对话式人工智能，但其在扩展多轮对话中的鲁棒性仍鲜为人知。现有评估框架主要关注静态基准和单轮评估，未能捕捉现实交互中对话质量随时间退化的动态特征。本研究对对话鲁棒性进行了大规模生存分析，将失败建模为时间-事件过程，基于MT-Consistency基准测试中9个前沿LLMs的36,951轮对话数据。我们的框架将Cox比例风险模型、加速失效时间模型和随机生存森林模型与简单的语义漂移特征相结合。研究发现：提示间的突发性语义漂移会急剧增加不一致性风险，而累积漂移反直觉地具有保护作用，表明在经历多次语义转换后仍持续的对话中存在适应性机制。结合模型-漂移交互项的AFT模型实现了区分度与校准度的最佳平衡，比例风险检验揭示了关键漂移协变量的系统性违例，这解释了Cox类模型在此场景中的局限性。最后，我们证明轻量级AFT模型可转化为轮次级风险监测器，能在首个不一致答案出现前数轮标记大多数失败对话，同时将误报率保持在较低水平。这些成果确立了生存分析作为评估多轮鲁棒性和设计对话AI系统实用保障机制的有效范式。