Large Language Models (LLMs) have been augmented with web search to overcome the limitations of the static knowledge boundary by accessing up-to-date information from the open Internet. While this integration enhances model capability, it also introduces a distinct safety threat surface: the retrieval and citation process has the potential risk of exposing users to harmful or low-credibility web content. Existing red-teaming methods are largely designed for standalone LLMs as they primarily focus on unsafe generation, ignoring risks emerging from the complex search workflow. To address this gap, we propose CREST-Search, a pioneering red-teaming framework for LLMs with web search. The cornerstone of CREST-Search is three novel attack strategies that generate seemingly benign search queries yet induce unsafe citations. It also employs an iterative in-context refinement mechanism to strengthen adversarial effectiveness under black-box constraints. In addition, we construct a search-specific harmful dataset, WebSearch-Harm, which enables fine-tuning a specialized red-teaming model to improve query quality. Our experiments demonstrate that CREST-Search can effectively bypass safety filters and systematically expose vulnerabilities in web search-based LLM systems, underscoring the necessity of the development of robust search models.
翻译:大语言模型(LLMs)已通过集成网络搜索功能来突破静态知识边界的限制,从而能够从开放互联网获取最新信息。尽管这种集成增强了模型能力,但也引入了一个独特的安全威胁面:检索与引用过程存在使用户接触有害或低可信度网络内容的潜在风险。现有的红队测试方法主要针对独立运行的大语言模型设计,因其重点关注不安全内容生成,而忽略了复杂搜索工作流程中涌现的风险。为填补这一空白,我们提出了CREST-Search——一个面向具备网络搜索功能的大语言模型的开创性红队测试框架。该框架的核心是三种新颖的攻击策略,这些策略能生成看似良性的搜索查询,却诱导模型引用不安全内容。同时,框架采用迭代式上下文优化机制,以在黑盒约束条件下增强对抗效果。此外,我们构建了面向搜索场景的有害数据集WebSearch-Harm,通过微调专用红队测试模型来提升查询质量。实验表明,CREST-Search能有效绕过安全过滤器,系统性地暴露基于网络搜索的大语言模型系统的脆弱性,这凸显了开发鲁棒搜索模型的必要性。