Large Language Models (LLMs) with agentic web search capabilities show strong potential for tasks requiring real-time information access and complex fact retrieval, yet evaluating such systems remains challenging. We introduce \bench, a rigorous and regularly updated benchmark designed to assess the agentic web search abilities of LLMs. \bench automatically generates fresh question-answer pairs from recent news articles, ensuring that questions require information beyond an LLM's training data and enabling clear separation between internal knowledge and search capability. The benchmark features intentionally difficult questions requiring multi-hop search queries, page visits, and reasoning, making it well-suited for evaluating agentic search behavior. Our automated data curation and question generation pipeline enables frequent benchmark updates and supports construction of a large-scale training dataset for agentic web search models, addressing the scarcity of such data in the research community. To ensure reliable evaluation, we include a subset of human-verified samples in the test set. We evaluate a broad range of systems using \bench, including commercial and open-weight LLMs as well as LLM-based web search APIs. The leaderboard, datasets, and code are publicly available at livenewsbench.com.
翻译:具备代理式网络搜索能力的大语言模型在需要实时信息获取和复杂事实检索的任务中展现出巨大潜力,但对此类系统的评估仍具挑战性。本文提出 \bench——一个严格且定期更新的基准测试,旨在评估大语言模型的代理式网络搜索能力。\bench 能够从近期新闻文章中自动生成新鲜的问答对,确保问题所需信息超出大语言模型的训练数据范围,从而清晰区分模型内部知识与搜索能力。该基准测试特别设计了需要多跳搜索查询、页面访问和推理的难题,非常适合评估代理式搜索行为。我们自动化的数据整理与问题生成流程支持基准测试的频繁更新,并有助于构建用于代理式网络搜索模型的大规模训练数据集,以解决研究社区中此类数据稀缺的问题。为确保评估的可靠性,我们在测试集中包含了经过人工验证的样本子集。我们使用 \bench 评估了广泛的系统,包括商业和开源权重的大语言模型以及基于大语言模型的网络搜索 API。排行榜、数据集和代码已在 livenewsbench.com 公开提供。