Large language models have advanced from single-turn question answering to deep research systems that iteratively decompose research questions, invoke retrieval tools, and synthesize information across multiple rounds. Evaluating such systems typically involves scoring their final research reports holistically, but this end-to-end paradigm tightly couples the language model's decision-making, workflow design, and environmental feedback, precluding decomposable analysis of individual components. We introduce ScholarGym, an evaluation environment that isolates the information-gathering stage of deep research on academic literature. Under a unified workflow, ScholarGym decomposes the research process into three explicit stages -- Query Planning, Tool Invocation, and Relevance Assessment -- and evaluates each against 2,536 expert-annotated queries over a static corpus of 570K papers with deterministic retrieval. Systematic experiments reveal that iterative query decomposition yields 2.9--3.3$\times$ F1 gains over single-query retrieval, models with extended thinking trade recall for precision, and Query Planning quality together with Relevance Assessment constitute dual bottlenecks that separate proprietary from open-source model performance.
翻译:大语言模型已从单轮问答系统发展到能够迭代分解研究问题、调用检索工具并在多轮中综合信息的深度研究系统。评估此类系统通常涉及对其最终研究报告进行整体评分,但这种端到端范式将语言模型的决策、工作流设计和环境反馈紧密耦合,阻碍了对各组成部分的可分解分析。我们提出了ScholarGym,这是一个在学术文献上进行深度研究时隔离信息搜集阶段的评估环境。在统一工作流下,ScholarGym将研究过程分解为三个明确阶段——查询规划、工具调用和相关性评估——并基于包含57万篇论文的静态语料库,通过确定性检索对每个阶段在2,536个专家标注查询上进行评估。系统实验表明:迭代式查询分解相比单次查询检索可获得2.9–3.3倍的F1值提升;具有扩展思维链的模型会以召回率为代价换取精确率;查询规划质量与相关性评估共同构成区分专有模型与开源模型性能的双重瓶颈。