Detection and Defense Against Prominent Attacks on Preconditioned LLM-Integrated Virtual Assistants

from arxiv, Accepted to be published in the Proceedings of the 10th IEEE CSDE 2023, the Asia-Pacific Conference on Computer Science and Data Engineering 2023

The emergence of LLM (Large Language Model) integrated virtual assistants has brought about a rapid transformation in communication dynamics. During virtual assistant development, some developers prefer to leverage the system message, also known as an initial prompt or custom prompt, for preconditioning purposes. However, it is important to recognize that an excessive reliance on this functionality raises the risk of manipulation by malicious actors who can exploit it with carefully crafted prompts. Such malicious manipulation poses a significant threat, potentially compromising the accuracy and reliability of the virtual assistant's responses. Consequently, safeguarding the virtual assistants with detection and defense mechanisms becomes of paramount importance to ensure their safety and integrity. In this study, we explored three detection and defense mechanisms aimed at countering attacks that target the system message. These mechanisms include inserting a reference key, utilizing an LLM evaluator, and implementing a Self-Reminder. To showcase the efficacy of these mechanisms, they were tested against prominent attack techniques. Our findings demonstrate that the investigated mechanisms are capable of accurately identifying and counteracting the attacks. The effectiveness of these mechanisms underscores their potential in safeguarding the integrity and reliability of virtual assistants, reinforcing the importance of their implementation in real-world scenarios. By prioritizing the security of virtual assistants, organizations can maintain user trust, preserve the integrity of the application, and uphold the high standards expected in this era of transformative technologies.

翻译：大语言模型（LLM）集成虚拟助手的出现，迅速改变了通信动态。在虚拟助手开发过程中，部分开发者倾向于利用系统消息（也称为初始提示或自定义提示）进行预条件化。然而，必须认识到，过度依赖此功能会增加被恶意行为者利用的风险，他们可通过精心构造的提示对其进行操控。此类恶意操控构成重大威胁，可能损害虚拟助手响应的准确性与可靠性。因此，为虚拟助手配备检测与防御机制，以确保其安全性与完整性至关重要。在本研究中，我们探索了三种针对系统消息攻击的检测与防御机制，包括插入参考密钥、利用LLM评估器以及实施自我提醒。为展示这些机制的有效性，我们针对显著攻击技术进行了测试。研究结果表明，所研究的机制能够准确识别并抵御攻击。这些机制的有效性凸显了其在保障虚拟助手完整性与可靠性方面的潜力，强调了在现实场景中实施这些机制的重要性。通过优先考虑虚拟助手的安全性，组织可维护用户信任，保障应用完整性，并实现在这一变革性技术时代所期望的高标准。