Highly specific datasets of scientific literature are important for both research and education. However, it is difficult to build such datasets at scale. A common approach is to build these datasets reductively by applying topic modeling on an established corpus and selecting specific topics. A more robust but time-consuming approach is to build the dataset constructively in which a subject matter expert (SME) handpicks documents. This method does not scale and is prone to error as the dataset grows. Here we showcase a new tool, based on machine learning, for constructively generating targeted datasets of scientific literature. Given a small initial "core" corpus of papers, we build a citation network of documents. At each step of the citation network, we generate text embeddings and visualize the embeddings through dimensionality reduction. Papers are kept in the dataset if they are "similar" to the core or are otherwise pruned through human-in-the-loop selection. Additional insight into the papers is gained through sub-topic modeling using SeNMFk. We demonstrate our new tool for literature review by applying it to two different fields in machine learning.
翻译:科学研究文献的高度特异性数据集对于研究与教育均至关重要。然而,大规模构建此类数据集颇具挑战。一种常见方法是基于既有语料库进行主题建模并选取特定主题,从而以还原方式构建数据集。另一种更稳健但耗时的方法是以建构方式构建数据集,由领域专家(SME)人工挑选文献。此类方法难以规模化,且随着数据集增长容易产生误差。本文展示了一种基于机器学习的新工具,用于建构性地生成目标科学文献数据集。给定一个较小的初始"核心"文献语料库,我们构建文献的引文网络。在引文网络的每一步,我们生成文本嵌入并通过降维可视化嵌入结果。若文献与核心文献"相似"则保留在数据集中,否则通过人在回路选择机制剔除。通过使用SeNMFk进行子主题建模,可进一步洞察文献内容。我们将这一文献综述新工具应用于机器学习领域的两个不同方向,并验证其效能。