Assessing biomedical knowledge robustness in large language models by query-efficient sampling attacks

The increasing depth of parametric domain knowledge in large language models (LLMs) is fueling their rapid deployment in real-world applications. Understanding model vulnerabilities in high-stakes and knowledge-intensive tasks is essential for quantifying the trustworthiness of model predictions and regulating their use. The recent discovery of named entities as adversarial examples (i.e. adversarial entities) in natural language processing tasks raises questions about their potential impact on the knowledge robustness of pre-trained and finetuned LLMs in high-stakes and specialized domains. We examined the use of type-consistent entity substitution as a template for collecting adversarial entities for billion-parameter LLMs with biomedical knowledge. To this end, we developed an embedding-space attack based on powerscaled distance-weighted sampling to assess the robustness of their biomedical knowledge with a low query budget and controllable coverage. Our method has favorable query efficiency and scaling over alternative approaches based on random sampling and blackbox gradient-guided search, which we demonstrated for adversarial distractor generation in biomedical question answering. Subsequent failure mode analysis uncovered two regimes of adversarial entities on the attack surface with distinct characteristics and we showed that entity substitution attacks can manipulate token-wise Shapley value explanations, which become deceptive in this setting. Our approach complements standard evaluations for high-capacity models and the results highlight the brittleness of domain knowledge in LLMs.

翻译：大型语言模型（LLM）中参数化领域知识的不断深化正推动其在现实应用中的快速部署。理解模型在高风险与知识密集型任务中的脆弱性，对于量化模型预测的可信度及规范其使用至关重要。近期在自然语言处理任务中发现命名实体可作为对抗样本（即对抗实体），这引发了关于其对预训练与微调LLM在高风险专业领域中知识鲁棒性潜在影响的疑问。我们研究了使用类型一致的实体替换作为模板，为具备生物医学知识的数十亿参数LLM收集对抗实体的方法。为此，我们开发了一种基于幂尺度距离加权采样的嵌入空间攻击方法，以在低查询预算与可控覆盖范围下评估其生物医学知识的鲁棒性。相较于基于随机采样与黑盒梯度引导搜索的替代方案，我们的方法在查询效率与可扩展性方面表现更优，我们在生物医学问答的对抗干扰项生成任务中验证了这一点。后续的故障模式分析揭示了攻击面上对抗实体的两种不同特征机制，并表明实体替换攻击可操纵基于词元的Shapley值解释，使其在此情境下产生误导性。我们的方法为高容量模型的标准评估提供了补充，其结果突显了LLM中领域知识的脆弱性。