Interactive Distillation of Large Single-Topic Corpora of Scientific Papers

Highly specific datasets of scientific literature are important for both research and education. However, it is difficult to build such datasets at scale. A common approach is to build these datasets reductively by applying topic modeling on an established corpus and selecting specific topics. A more robust but time-consuming approach is to build the dataset constructively in which a subject matter expert (SME) handpicks documents. This method does not scale and is prone to error as the dataset grows. Here we showcase a new tool, based on machine learning, for constructively generating targeted datasets of scientific literature. Given a small initial "core" corpus of papers, we build a citation network of documents. At each step of the citation network, we generate text embeddings and visualize the embeddings through dimensionality reduction. Papers are kept in the dataset if they are "similar" to the core or are otherwise pruned through human-in-the-loop selection. Additional insight into the papers is gained through sub-topic modeling using SeNMFk. We demonstrate our new tool for literature review by applying it to two different fields in machine learning.

翻译：科学研究文献的高度特异性数据集对于研究与教育均至关重要。然而，大规模构建此类数据集颇具挑战。一种常见方法是基于既有语料库进行主题建模并选取特定主题，从而以还原方式构建数据集。另一种更稳健但耗时的方法是以建构方式构建数据集，由领域专家（SME）人工挑选文献。此类方法难以规模化，且随着数据集增长容易产生误差。本文展示了一种基于机器学习的新工具，用于建构性地生成目标科学文献数据集。给定一个较小的初始"核心"文献语料库，我们构建文献的引文网络。在引文网络的每一步，我们生成文本嵌入并通过降维可视化嵌入结果。若文献与核心文献"相似"则保留在数据集中，否则通过人在回路选择机制剔除。通过使用SeNMFk进行子主题建模，可进一步洞察文献内容。我们将这一文献综述新工具应用于机器学习领域的两个不同方向，并验证其效能。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

生成性对抗网络:理论模型、评估指标和最近发展的概述，Generative Adversarial Networks (GANs): An Overview of Theoretical Model, Evaluation Metrics, and Recent Developments

专知会员服务

42+阅读 · 2020年5月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日