AutoDataset: A Lightweight System for Continuous Dataset Discovery and Search

The continuous expansion of task-specific datasets has become a major driver of progress in machine learning. However, discovering newly released datasets remains difficult, as existing platforms largely depend on manual curation or community submissions, leading to limited coverage and substantial delays. To address this challenge, we introduce AutoDataset, a lightweight, automated system for real-time dataset discovery and retrieval. AutoDataset adopts a paper-first approach by continuously monitoring arXiv to detect and index datasets directly from newly published research. The system operates through a low-overhead multi-stage pipeline. First, a lightweight classifier rapidly filters titles and abstracts to identify papers releasing datasets, achieving an F1 score of 0.94 with an inference latency of 11 ms. For identified papers, we parse PDFs with GROBID and apply a sentence-level extractor to extract dataset descriptions. Dataset URLs are extracted from the paper text with an automated fallback to LaTeX source analysis when needed. Finally, the structured records are indexed using a dense semantic retriever, enabling low-latency natural language search. We deploy AutoDataset as a live system that continuously ingests new papers and provides up-to-date dataset discovery. In practice, it has been shown to significantly reduce the time required for researchers to locate newly released datasets, improving dataset discovery efficiency by up to 80%.

翻译：任务特定数据集的持续扩展已成为机器学习进步的主要驱动力。然而，新发布数据集的发现仍然困难，因为现有平台主要依赖人工整理或社区提交，导致覆盖范围有限且存在显著延迟。为应对这一挑战，我们提出了AutoDataset，一个用于实时数据集发现与检索的轻量级自动化系统。AutoDataset采用论文优先的方法，通过持续监控arXiv，直接从新发表的研究中检测和索引数据集。该系统通过一个低开销的多阶段流程运行。首先，一个轻量级分类器快速筛选标题和摘要，以识别发布数据集的论文，其F1分数达到0.94，推理延迟为11毫秒。对于识别出的论文，我们使用GROBID解析PDF，并应用句子级提取器提取数据集描述。数据集URL从论文文本中提取，必要时自动回退至LaTeX源码分析。最后，结构化记录使用密集语义检索器进行索引，从而实现低延迟的自然语言搜索。我们将AutoDataset部署为一个实时系统，持续收录新论文并提供最新的数据集发现。实践表明，该系统能显著减少研究人员定位新发布数据集所需的时间，将数据集发现效率提升高达80%。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

AutoResearch AI综述：迈向AI驱动的科学发现自动化

专知会员服务

15+阅读 · 5月26日

【CMU博士论文】异构数据导航：构建面向多样化数据类型、领域及复杂性的 AI 系统

专知会员服务

19+阅读 · 2月12日

自动驾驶开源数据体系：现状与未来

专知会员服务

41+阅读 · 2024年1月28日