Deep research agents powered by Large Language Models (LLMs) can perform multi-step reasoning, web exploration, and long-form report generation. However, existing systems remain largely autonomous, assuming fully specified user intent and evaluating only final outputs. In practice, research goals are often underspecified and evolve during exploration, yet current benchmarks neither model dynamic user feedback nor measure interaction costs. To address this gap, we introduce IDRBench, the first Interactive Deep Research Benchmark for systematically evaluating the interactive capabilities of deep research agents. IDRBench formulates deep research as an interactive process where agents may solicit clarification to better align with user intent. It integrates a modular interactive framework, a scalable reference-grounded user simulator, and an interaction-aware evaluation suite that jointly measures alignment gains and interaction overhead. Experiments on seven representative proprietary and open-weight LLMs show that interaction consistently improves research quality and robustness, while revealing substantial differences in interaction efficiency across models. These findings establish interactive capability as a distinct evaluation dimension and position IDRBench as a reusable benchmark for future user-aligned deep research agents.
翻译:暂无翻译