Large language models (LLMs) have been widely integrated into critical automated workflows, including contract review and job application processes. However, LLMs are susceptible to manipulation by fraudulent information, which can lead to harmful outcomes. Although advanced defense methods have been developed to address this issue, they often exhibit limitations in effectiveness, interpretability, and generalizability, particularly when applied to LLM-based applications. To address these challenges, we introduce FraudShield, a novel framework designed to protect LLMs from fraudulent content by leveraging a comprehensive analysis of fraud tactics. Specifically, FraudShield constructs and refines a fraud tactic-keyword knowledge graph to capture high-confidence associations between suspicious text and fraud techniques. The structured knowledge graph augments the original input by highlighting keywords and providing supporting evidence, guiding the LLM toward more secure responses. Extensive experiments show that FraudShield consistently outperforms state-of-the-art defenses across four mainstream LLMs and five representative fraud types, while also offering interpretable clues for the model's generations.
翻译:大语言模型(LLMs)已被广泛应用于关键自动化工作流程中,包括合同审查与职位申请处理。然而,大语言模型易受欺诈信息操控,可能导致有害后果。尽管已有先进的防御方法被提出以应对此问题,但这些方法在应用于基于大语言模型的实际系统时,往往在防御效能、可解释性与泛化能力方面存在局限。为应对这些挑战,本文提出FraudShield——一种通过系统分析欺诈手法来保护大语言模型免受欺诈内容侵害的新型框架。具体而言,FraudShield构建并优化了欺诈手法-关键词知识图谱,用以捕捉可疑文本与欺诈技术之间的高置信度关联。该结构化知识图谱通过高亮关键词并提供支撑证据来增强原始输入,从而引导大语言模型生成更安全的响应。大量实验表明,在四种主流大语言模型和五类典型欺诈场景中,FraudShield始终优于现有最先进的防御方法,同时为模型生成过程提供了可解释的决策线索。