Identifying relevant research concepts is crucial for effective scientific search. However, primary sparse retrieval methods often lack concept-aware representations. To address this, we propose CASPER, a sparse retrieval model for scientific search that utilizes both tokens and keyphrases as representation units (i.e., dimensions in the sparse embedding space). This enables CASPER to represent queries and documents via research concepts and match them at both granular and conceptual levels. Furthermore, we construct training data by leveraging abundant scholarly references (including titles, citation contexts, author-assigned keyphrases, and co-citations), which capture how research concepts are expressed in diverse settings. Empirically, CASPER outperforms strong dense and sparse retrieval baselines across eight scientific retrieval benchmarks. We also explore the effectiveness-efficiency trade-off via representation pruning and demonstrate CASPER's interpretability by showing that it can serve as an effective and efficient keyphrase generation model.
翻译:识别相关研究概念对于有效的科学检索至关重要。然而,主流的稀疏检索方法通常缺乏概念感知的表示能力。为解决这一问题,我们提出了CASPER,一种面向科学检索的稀疏检索模型,该模型同时使用词元(tokens)和关键短语(keyphrases)作为表示单元(即稀疏嵌入空间中的维度)。这使得CASPER能够通过研究概念来表示查询和文档,并在细粒度和概念层面进行匹配。此外,我们通过利用丰富的学术引用信息(包括标题、引文上下文、作者指定的关键短语以及共被引关系)来构建训练数据,这些数据捕捉了研究概念在不同情境下的表达方式。实验表明,CASPER在八个科学检索基准测试中均优于强力的稠密检索和稀疏检索基线方法。我们还通过表示剪枝探索了效果与效率的权衡,并通过展示CASPER可作为一个高效且有效的关键短语生成模型,论证了其可解释性。