Hallucination in large language models (LLMs) remains an acute concern, contributing to the spread of misinformation and diminished public trust, particularly in high-risk domains. Among hallucination types, factuality is crucial, as it concerns a model's alignment with established world knowledge. Adversarial factuality, defined as the deliberate insertion of misinformation into prompts with varying levels of expressed confidence, tests a model's ability to detect and resist confidently framed falsehoods. Existing work lacks high-quality, domain-specific resources for assessing model robustness under such adversarial conditions, and no prior research has examined the impact of injected misinformation on long-form text factuality. To address this gap, we introduce AdversaRiskQA, the first verified and reliable benchmark systematically evaluating adversarial factuality across Health, Finance, and Law. The benchmark includes two difficulty levels to test LLMs' defensive capabilities across varying knowledge depths. We propose two automated methods for evaluating the adversarial attack success and long-form factuality. We evaluate six open- and closed-source LLMs from the Qwen, GPT-OSS, and GPT families, measuring misinformation detection rates. Long-form factuality is assessed on Qwen3 (30B) under both baseline and adversarial conditions. Results show that after excluding meaningless responses, Qwen3 (80B) achieves the highest average accuracy, while GPT-5 maintains consistently high accuracy. Performance scales non-linearly with model size, varies by domains, and gaps between difficulty levels narrow as models grow. Long-form evaluation reveals no significant correlation between injected misinformation and the model's factual output. AdversaRiskQA provides a valuable benchmark for pinpointing LLM weaknesses and developing more reliable models for high-stakes applications.
翻译:大型语言模型(LLM)中的幻觉问题依然严峻,尤其是在高风险领域,其助长了错误信息的传播并削弱了公众信任。在各类幻觉中,事实性至关重要,因为它关乎模型与既定世界知识的一致性。对抗性事实性定义为在提示中故意插入不同表达置信度的错误信息,以测试模型检测和抵抗被自信表述的虚假信息的能力。现有工作缺乏高质量、领域特定的资源来评估模型在此类对抗条件下的鲁棒性,且尚无研究考察注入的错误信息对长文本事实性的影响。为填补这一空白,我们提出了AdversaRiskQA,这是首个经过验证且可靠的基准,系统性地评估了健康、金融和法律领域的对抗性事实性。该基准包含两个难度级别,以测试LLM在不同知识深度下的防御能力。我们提出了两种自动化方法来评估对抗攻击的成功率与长文本事实性。我们评估了来自Qwen、GPT-OSS和GPT家族的六个开源及闭源LLM,测量了其错误信息检测率。长文本事实性则在基线条件和对抗条件下对Qwen3(30B)进行了评估。结果显示,在排除无意义响应后,Qwen3(80B)取得了最高的平均准确率,而GPT-5则保持了持续的高准确率。性能随模型规模呈非线性增长,因领域而异,且难度级别间的差距随模型增大而缩小。长文本评估表明,注入的错误信息与模型的事实性输出之间无显著相关性。AdversaRiskQA为精确定位LLM弱点以及为高风险应用开发更可靠的模型提供了一个有价值的基准。