As LLMs gain persuasive agentic capabilities through extended dialogues, they introduce novel risks in multi-turn conversational scams that single-turn safety evaluations fail to capture. We systematically study these risks using a controlled LLM-to-LLM simulation framework across multi-turn scam scenarios. Evaluating eight state-of-the-art models in English and Chinese, we analyze dialogue outcomes and qualitatively annotate attacker strategies, defensive responses, and failure modes. Results reveal that scam interactions follow recurrent escalation patterns, while defenses employ verification and delay mechanisms. Furthermore, interactional failures frequently stem from safety guardrail activation and role instability. Our findings highlight multi-turn interactional safety as a critical, distinct dimension of LLM behavior.
翻译:随着LLM通过扩展对话获得说服性代理能力,它们引入了单轮安全评估无法捕捉的多轮对话诈骗新风险。我们使用受控的LLM到LLM仿真框架,在多轮诈骗场景中系统研究这些风险。通过评估八个最先进的英文和中文模型,我们分析对话结果并定性标注攻击者策略、防御响应及失效模式。结果显示诈骗交互遵循重复的升级模式,而防御机制采用验证与延迟策略。此外,交互失效常源于安全护栏激活与角色不稳定性。我们的研究强调多轮交互安全是LLM行为中一个关键且独特的维度。