We introduce SealQA, a new challenge benchmark for evaluating SEarch-Augmented Language models on fact-seeking questions where web search yields conflicting, noisy, or unhelpful results. SealQA comes in three flavors: (1) Seal-0 (main) and (2) Seal-Hard, which assess factual accuracy and reasoning capabilities, with Seal-0 focusing on the most challenging questions where chat models (e.g., GPT-4.1) typically achieve near-zero accuracy; and (3) LongSeal, which extends SealQA to test long-context, multi-document reasoning in "needle-in-a-haystack" settings. Our evaluation reveals critical limitations in current models: Even frontier LLMs perform poorly across all SealQA flavors. On Seal-0, frontier agentic models equipped with tools like o3 and o4-mini achieve only 17.1% and 6.3% accuracy, respectively, at their best reasoning efforts. We find that advanced reasoning models such as DeepSeek-R1-671B and o3-mini are highly vulnerable to noisy search results. Notably, increasing test-time compute does not yield reliable gains across o3-mini, o4-mini, and o3, with performance often plateauing or even declining early. Additionally, while recent models are less affected by the "lost-in-the-middle" issue, they still fail to reliably identify relevant documents in LongSeal when faced with numerous distractors. To facilitate future work, we release SealQA at huggingface.co/datasets/vtllms/sealqa.
翻译:我们提出了SealQA,这是一个用于评估搜索增强型语言模型(SEarch-Augmented Language models)在处理事实性问题时的新挑战基准,其中网络搜索会返回矛盾、嘈杂或无帮助的结果。SealQA包含三个版本:(1)Seal-0(主版本)和(2)Seal-Hard,用于评估事实准确性和推理能力,其中Seal-0聚焦于最具挑战性的问题,而聊天模型(如GPT-4.1)通常在这些问题上准确率接近零;(3)LongSeal,将SealQA扩展为在"大海捞针"场景中测试长上下文、多文档推理能力。我们的评估揭示了当前模型的严重局限性:即使是前沿大语言模型(LLMs)在所有SealQA版本上表现均不佳。在Seal-0上,配备工具(如o3和o4-mini)的前沿智能体模型在最佳推理努力下,准确率分别仅为17.1%和6.3%。我们发现,诸如DeepSeek-R1-671B和o3-mini等高级推理模型极易受嘈杂搜索结果影响。值得注意的是,在o3-mini、o4-mini和o3上增加测试时计算量并不会带来可靠的性能提升,其表现往往较早地进入平台期甚至下降。此外,尽管近期模型受"中间丢失"问题的影响较小,但在面对大量干扰项时,它们在LongSeal场景中仍然无法可靠地识别相关文档。为促进未来研究,我们将SealQA发布在huggingface.co/datasets/vtllms/sealqa。