DataFinder: Scientific Dataset Recommendation from Natural Language Descriptions

Modern machine learning relies on datasets to develop and validate research ideas. Given the growth of publicly available data, finding the right dataset to use is increasingly difficult. Any research question imposes explicit and implicit constraints on how well a given dataset will enable researchers to answer this question, such as dataset size, modality, and domain. We introduce a new task of recommending relevant datasets given a short natural language description of a research idea, to help people find relevant datasets for their needs. Dataset recommendation poses unique challenges as an information retrieval problem; datasets are hard to directly index for search and there are no corpora readily available for this task. To operationalize this task, we build the DataFinder Dataset which consists of a larger automatically-constructed training set (17.5K queries) and a smaller expert-annotated evaluation set (392 queries). Using this data, we compare various information retrieval algorithms on our test set and present the first-ever published system for text-based dataset recommendation using machine learning techniques. This system, trained on the DataFinder Dataset, finds more relevant search results than existing third-party dataset search engines. To encourage progress on dataset recommendation, we release our dataset and models to the public.

翻译：现代机器学习依赖数据集来开发和验证研究想法。随着公开可用数据量的增长，寻找合适的数据集变得越来越困难。任何研究问题都会对给定数据集能多好地帮助研究者回答该问题施加显式和隐式约束，例如数据集规模、模态和领域。我们提出一个新任务：根据研究想法的简短自然语言描述推荐相关数据集，以帮助人们找到满足其需求的数据集。数据集推荐作为信息检索问题面临独特挑战：数据集难以直接索引用于搜索，且缺乏现成的语料库支持该任务。为实施该任务，我们构建了DataFinder数据集，包含一个较大的自动构建训练集（17500条查询）和一个较小的专家标注评估集（392条查询）。利用这些数据，我们在测试集上比较了多种信息检索算法，并首次发布基于机器学习技术的文本驱动数据集推荐系统。该系统在DataFinder数据集上训练后，能比现有第三方数据集搜索引擎找到更相关的搜索结果。为促进数据集推荐领域的进展，我们将数据集和模型向公众开放发布。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

NeurlPS 2022 | 自然语言处理相关论文分类整理

专知会员服务

51+阅读 · 2022年10月2日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

164+阅读 · 2019年10月12日