Scams exploiting real-time social engineering -- such as phishing, impersonation, and phone fraud -- remain a persistent and evolving threat across digital platforms. Existing defenses are largely reactive, offering limited protection during active interactions. We propose a privacy-preserving, AI-in-the-loop framework that proactively detects and disrupts scam conversations in real time. The system combines instruction-tuned artificial intelligence with a safety-aware utility function that balances engagement with harm minimization, and employs federated learning to enable continual model updates without raw data sharing. Experimental evaluations show that the system produces fluent and engaging responses (perplexity as low as 22.3, engagement $\approx$0.80), while human studies confirm significant gains in realism, safety, and effectiveness over strong baselines. In federated settings, models trained with FedAvg sustain up to 30 rounds while preserving high engagement ($\approx$0.80), strong relevance ($\approx$0.74), and low PII leakage ($\leq$0.0085). Even with differential privacy, novelty and safety remain stable, indicating that robust privacy can be achieved without sacrificing performance. The evaluation of guard models (LlamaGuard, LlamaGuard2/3, MD-Judge) shows a straightforward pattern: stricter moderation settings reduce the chance of exposing personal information, but they also limit how much the model engages in conversation. In contrast, more relaxed settings allow longer and richer interactions, which improve scam detection, but at the cost of higher privacy risk. To our knowledge, this is the first framework to unify real-time scam-baiting, federated privacy preservation, and calibrated safety moderation into a proactive defense paradigm.
翻译:利用实时社会工程学手段的诈骗行为——如钓鱼攻击、身份冒充和电话诈骗——在数字平台上持续存在且不断演变,构成持久威胁。现有防御机制大多是被动的,在主动交互过程中提供的保护有限。本文提出一种隐私保护的“AI-in-the-loop”框架,能够实时主动检测并中断诈骗对话。该系统结合指令调优人工智能与具备安全意识的效用函数,在保持对话参与度的同时最小化潜在危害,并采用联邦学习实现无需原始数据共享的持续模型更新。实验评估表明,该系统能生成流畅且具吸引力的响应(困惑度低至22.3,参与度$\approx$0.80),而人工研究证实其在真实性、安全性和有效性方面均显著优于强基线模型。在联邦学习设置中,采用FedAvg训练的模型可持续30轮训练周期,同时保持高参与度($\approx$0.80)、强相关性($\approx$0.74)和低个人身份信息泄露率($\leq$0.0085)。即使引入差分隐私机制,生成新颖性与安全性仍保持稳定,表明可在不牺牲性能的前提下实现强隐私保护。对防护模型(LlamaGuard、LlamaGuard2/3、MD-Judge)的评估呈现明确规律:更严格的审核设置虽能降低个人信息暴露概率,但会限制模型对话参与度;反之,更宽松的设置允许更长久丰富的交互从而提升诈骗检测能力,但需承担更高的隐私风险。据我们所知,这是首个将实时诈骗诱骗、联邦隐私保护与可调节安全审核统一于主动防御范式的框架。