Search Agents (SAs) typically leverage large language models (LLMs) to support complex information-seeking tasks by autonomously exploring web sources and synthesizing information into comprehensive responses. For SAs evaluation, prior benchmarks mainly focus on specialized tasks that are unlikely to arise in real-world user scenarios. Moreover, their reliance on coarse task-level rubrics often limits evaluation interpretability. To bridge this gap, we introduce DailyReport, an open-ended benchmark to evaluate SA capabilities on daily search tasks. It contains 150 open-ended tasks with 3,546 associated rubrics, capturing widely discussed and timely information demands of real-world users. Each task is decomposed into subtasks and evaluated with cascade rubrics across disentangled dimensions. Through cascade performance attribution and user-centric aggregation, we derive highly interpretable scores for each dimension, along with a user preference score. Our results on 17 agentic systems show that current systems still fall short of users' expectations. To facilitate future research, our dataset and code are made publicly available at https://github.com/AGI-Eval-Official/DailyReport.
翻译:搜索代理(Search Agents, SAs)通常借助大语言模型(LLMs),通过自主探索网络资源并将信息整合为全面回答,以支持复杂的信息检索任务。现有SAs评估基准主要聚焦于专业任务,这类任务在真实用户场景中极少出现;同时,它们依赖粗粒度的任务级评分准则,往往限制了评估的可解释性。为弥补这一差距,我们提出DailyReport——一个评估SAs日常搜索任务能力的开放式基准。该基准包含150个开放式任务和3,546项关联评分准则,捕捉真实用户广泛讨论且具有时效性的信息需求。每个任务被分解为子任务,并通过解耦维度的级联评分准则进行评估。通过级联性能归因与用户中心聚合,我们为每个维度推导出高可解释性评分及用户偏好得分。对17个代理系统的评估结果表明,当前系统仍未能满足用户期望。为促进后续研究,我们的数据集与代码已公开于https://github.com/AGI-Eval-Official/DailyReport。