Deep Research Agents (DRAs) can autonomously conduct complex investigations and generate comprehensive reports, demonstrating strong real-world potential. However, existing evaluations mostly rely on close-ended benchmarks, while open-ended deep research benchmarks remain scarce and typically neglect personalized scenarios. To bridge this gap, we introduce Personalized Deep Research Bench (PDR-Bench), the first benchmark for evaluating personalization in DRAs. It pairs 50 diverse research tasks across 10 domains with 25 authentic user profiles that combine structured persona attributes with dynamic real-world contexts, yielding 250 realistic user-task queries. To assess system performance, we propose the PQR Evaluation Framework, which jointly measures Personalization Alignment, Content Quality, and Factual Reliability. Our experiments on a range of systems highlight current capabilities and limitations in handling personalized deep research. This work establishes a rigorous foundation for developing and evaluating the next generation of truly personalized AI research assistants.
翻译:深度研究智能体(DRAs)能够自主进行复杂调查并生成全面报告,展现出强大的现实应用潜力。然而,现有评估大多依赖封闭式基准,而开放式深度研究基准仍然稀缺,且通常忽略了个性化场景。为弥补这一差距,我们提出了个性化深度研究基准(PDR-Bench),这是首个用于评估DRAs个性化能力的基准。它涵盖了10个领域的50个多样化研究任务,并与25个真实用户档案配对,这些档案结合了结构化的人物属性与动态的现实世界情境,从而产生了250个真实的用户-任务查询。为评估系统性能,我们提出了PQR评估框架,该框架联合度量个性化对齐度、内容质量与事实可靠性。我们在多种系统上进行的实验揭示了当前处理个性化深度研究的能力与局限。这项工作为开发和评估真正个性化的下一代人工智能研究助手奠定了严谨的基础。