Modern machine learning relies on datasets to develop and validate research ideas. Given the growth of publicly available data, finding the right dataset to use is increasingly difficult. Any research question imposes explicit and implicit constraints on how well a given dataset will enable researchers to answer this question, such as dataset size, modality, and domain. We introduce a new task of recommending relevant datasets given a short natural language description of a research idea, to help people find relevant datasets for their needs. Dataset recommendation poses unique challenges as an information retrieval problem; datasets are hard to directly index for search and there are no corpora readily available for this task. To operationalize this task, we build the DataFinder Dataset which consists of a larger automatically-constructed training set (17.5K queries) and a smaller expert-annotated evaluation set (392 queries). Using this data, we compare various information retrieval algorithms on our test set and present the first-ever published system for text-based dataset recommendation using machine learning techniques. This system, trained on the DataFinder Dataset, finds more relevant search results than existing third-party dataset search engines. To encourage progress on dataset recommendation, we release our dataset and models to the public.
翻译:现代机器学习依赖数据集来开发和验证研究想法。随着公开可用数据量的增长,寻找合适的数据集变得越来越困难。任何研究问题都会对给定数据集能多好地帮助研究者回答该问题施加显式和隐式约束,例如数据集规模、模态和领域。我们提出一个新任务:根据研究想法的简短自然语言描述推荐相关数据集,以帮助人们找到满足其需求的数据集。数据集推荐作为信息检索问题面临独特挑战:数据集难以直接索引用于搜索,且缺乏现成的语料库支持该任务。为实施该任务,我们构建了DataFinder数据集,包含一个较大的自动构建训练集(17500条查询)和一个较小的专家标注评估集(392条查询)。利用这些数据,我们在测试集上比较了多种信息检索算法,并首次发布基于机器学习技术的文本驱动数据集推荐系统。该系统在DataFinder数据集上训练后,能比现有第三方数据集搜索引擎找到更相关的搜索结果。为促进数据集推荐领域的进展,我们将数据集和模型向公众开放发布。