The continuous expansion of task-specific datasets has become a major driver of progress in machine learning. However, discovering newly released datasets remains difficult, as existing platforms largely depend on manual curation or community submissions, leading to limited coverage and substantial delays. To address this challenge, we introduce AutoDataset, a lightweight, automated system for real-time dataset discovery and retrieval. AutoDataset adopts a paper-first approach by continuously monitoring arXiv to detect and index datasets directly from newly published research. The system operates through a low-overhead multi-stage pipeline. First, a lightweight classifier rapidly filters titles and abstracts to identify papers releasing datasets, achieving an F1 score of 0.94 with an inference latency of 11 ms. For identified papers, we parse PDFs with GROBID and apply a sentence-level extractor to extract dataset descriptions. Dataset URLs are extracted from the paper text with an automated fallback to LaTeX source analysis when needed. Finally, the structured records are indexed using a dense semantic retriever, enabling low-latency natural language search. We deploy AutoDataset as a live system that continuously ingests new papers and provides up-to-date dataset discovery. In practice, it has been shown to significantly reduce the time required for researchers to locate newly released datasets, improving dataset discovery efficiency by up to 80%.
翻译:任务特定数据集的持续扩展已成为机器学习进步的主要驱动力。然而,新发布数据集的发现仍然困难,因为现有平台主要依赖人工整理或社区提交,导致覆盖范围有限且存在显著延迟。为应对这一挑战,我们提出了AutoDataset,一个用于实时数据集发现与检索的轻量级自动化系统。AutoDataset采用论文优先的方法,通过持续监控arXiv,直接从新发表的研究中检测和索引数据集。该系统通过一个低开销的多阶段流程运行。首先,一个轻量级分类器快速筛选标题和摘要,以识别发布数据集的论文,其F1分数达到0.94,推理延迟为11毫秒。对于识别出的论文,我们使用GROBID解析PDF,并应用句子级提取器提取数据集描述。数据集URL从论文文本中提取,必要时自动回退至LaTeX源码分析。最后,结构化记录使用密集语义检索器进行索引,从而实现低延迟的自然语言搜索。我们将AutoDataset部署为一个实时系统,持续收录新论文并提供最新的数据集发现。实践表明,该系统能显著减少研究人员定位新发布数据集所需的时间,将数据集发现效率提升高达80%。