Large language models (LLMs) like ChatGPT, Gemini, or LLaMA have been trending recently, demonstrating considerable advancement and generalizability power in countless domains. However, LLMs create an even bigger black box exacerbating opacity, with interpretability limited to few approaches. The uncertainty and opacity embedded in LLMs' nature restrict their application in high-stakes domains like financial fraud, phishing, etc. Current approaches mainly rely on traditional textual classification with posterior interpretable algorithms, suffering from attackers who may create versatile adversarial samples to break the system's defense, forcing users to make trade-offs between efficiency and robustness. To address this issue, we propose a novel cascading framework called Genshin (General Shield for Natural Language Processing with Large Language Models), utilizing LLMs as defensive one-time plug-ins. Unlike most applications of LLMs that try to transform text into something new or structural, Genshin uses LLMs to recover text to its original state. Genshin aims to combine the generalizability of the LLM, the discrimination of the median model, and the interpretability of the simple model. Our experiments on the task of sentimental analysis and spam detection have shown fatal flaws of the current median models and exhilarating results on LLMs' recovery ability, demonstrating that Genshin is both effective and efficient. In our ablation study, we unearth several intriguing observations. Utilizing the LLM defender, a tool derived from the 4th paradigm, we have reproduced BERT's 15% optimal mask rate results in the 3rd paradigm of NLP. Additionally, when employing the LLM as a potential adversarial tool, attackers are capable of executing effective attacks that are nearly semantically lossless.
翻译:近期,以ChatGPT、Gemini或LLaMA为代表的大型语言模型(LLMs)已成为研究热点,在众多领域展现出显著的进步与泛化能力。然而,LLMs构成了一个更为庞大的黑箱系统,加剧了模型的不透明性,其可解释性仅限于少数方法。LLMs固有的不确定性与不透明性限制了其在金融欺诈、网络钓鱼等高风险领域的应用。现有方法主要依赖传统文本分类与后验可解释算法,易受攻击者制造多样化对抗样本以突破系统防御,迫使用户在效率与鲁棒性之间进行权衡。为解决此问题,我们提出一种名为Genshin(面向大型语言模型的自然语言处理通用防护框架)的新型级联框架,将LLMs作为一次性防御插件使用。与大多数试图将文本转化为新形式或结构化表示的LLMs应用不同,Genshin利用LLMs将文本恢复至原始状态。该框架旨在融合LLMs的泛化能力、中间模型的判别能力以及简单模型的可解释性。我们在情感分析与垃圾邮件检测任务上的实验揭示了当前中间模型的致命缺陷,同时展现了LLMs文本恢复能力的振奋人心成果,证明Genshin兼具高效性与有效性。在消融研究中,我们发现了若干值得关注的现象:利用源自第四范式的LLM防御器,我们在NLP第三范式中复现了BERT的15%最优掩码率结果;此外,当将LLM作为潜在对抗工具时,攻击者能够实施语义损失近乎为零的有效攻击。