The rapid expansion of Earth Science data from satellite observations, reanalysis products, and numerical simulations has created a critical bottleneck in scientific discovery, namely identifying relevant datasets for a given research objective. Existing discovery systems are primarily retrieval-centric and struggle to bridge the gap between high-level scientific intent and heterogeneous metadata at scale. We introduce \textbf{ReSearch}, a multi-stage, reasoning-enhanced search framework that formulates Earth Science data discovery as an iterative process of intent interpretation, high-recall retrieval, and context-aware ranking. ReSearch integrates lexical search, semantic embeddings, abbreviation expansion, and large language model reranking within a unified architecture that explicitly separates recall and precision objectives. To enable realistic evaluation, we construct a literature-grounded benchmark by aligning natural language intent with datasets cited in peer-reviewed Earth Science studies. Experiments demonstrate that ReSearch consistently improves recall and ranking performance over baseline methods, particularly for task-based queries expressing abstract scientific goals. These results demonstrate the importance of intent-aware, multi-stage search as a foundational capability for reproducible and scalable Earth Science research.
翻译:随着卫星观测、再分析产品和数值模拟生成的地球科学数据迅速增长,科学发现面临一个关键瓶颈:如何为特定研究目标识别相关数据集。现有发现系统主要围绕检索构建,难以大规模弥合高层科学意图与异构元数据之间的鸿沟。我们提出 \textbf{ReSearch},一个多阶段、推理增强的搜索框架,将地球科学数据发现建模为意图解析、高召回率检索和上下文感知排序的迭代过程。ReSearch 在统一架构中整合了词法搜索、语义嵌入、缩写扩展和大语言模型重排序,并明确分离召回率与精确度目标。为实现现实评估,我们通过将自然语言意图与同行评议地球科学研究中引用的数据集对齐,构建了一个基于文献的基准。实验表明,ReSearch 在召回率和排序性能上持续优于基线方法,尤其对于表达抽象科学目标的任务型查询。这些结果证明了意图感知的多阶段搜索作为可复现、可扩展地球科学研究基础能力的重要性。