SeDa: A Unified System for Dataset Discovery and Multi-Entity Augmented Semantic Exploration

The continuous expansion of open data platforms and research repositories has led to a fragmented dataset ecosystem, posing significant challenges for cross-source data discovery and interpretation. To address these challenges, we introduce SeDa--a unified framework for dataset discovery, semantic annotation, and multi-entity augmented navigation. SeDa integrates more than 7.6 million datasets from over 200 platforms, spanning governmental, academic, and industrial domains. The framework first performs semantic extraction and standardization to harmonize heterogeneous metadata representations. On this basis, a topic-tagging mechanism constructs an extensible tag graph that supports thematic retrieval and cross-domain association, while a provenance assurance module embedded within the annotation process continuously validates dataset sources and monitors link availability to ensure reliability and traceability. Furthermore, SeDa employs a multi-entity augmented navigation strategy that organizes datasets within a knowledge space of sites, institutions, and enterprises, enabling contextual and provenance-aware exploration beyond traditional search paradigms. Comparative experiments with popular dataset search platforms, such as ChatPD and Google Dataset Search, demonstrate that SeDa achieves superior coverage, timeliness, and traceability. Taken together, SeDa establishes a foundation for trustworthy, semantically enriched, and globally scalable dataset exploration.

翻译：随着开放数据平台与研究存储库的持续扩张，数据集生态系统日益碎片化，给跨源数据发现与解读带来了重大挑战。为应对这些挑战，我们提出了SeDa——一个集数据集发现、语义标注与多实体增强导航于一体的统一框架。SeDa整合了来自200多个平台的超过760万个数据集，涵盖政府、学术及工业领域。该框架首先执行语义提取与标准化，以协调异构的元数据表示。在此基础上，一个主题标注机制构建了可扩展的标签图，支持主题检索与跨域关联；同时，嵌入在标注过程中的溯源保障模块持续验证数据集来源并监测链接可用性，以确保可靠性与可追溯性。此外，SeDa采用了一种多实体增强导航策略，将数据集组织在站点、机构与企业构成的知识空间中，实现了超越传统搜索范式的上下文感知与溯源感知的探索。与ChatPD、Google Dataset Search等主流数据集搜索平台的对比实验表明，SeDa在覆盖范围、时效性与可追溯性方面均表现更优。综上所述，SeDa为可信、语义丰富且具备全球可扩展性的数据集探索奠定了基础。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

视觉语义通信综述：分类体系、体系架构、关键赋能技术及应用现状

专知会员服务

18+阅读 · 2月2日

【NeurIPS2025】MIDAS：一种基于错配的用于失衡多模态学习的数据增强策略

专知会员服务

10+阅读 · 2025年10月1日

高质量数据集实践指南（1.0）

专知会员服务

32+阅读 · 2025年7月25日

DARPA“数据驱动的模型发现（D3M）”计划 |《统计探索、模型提取和策划（TwoRavens）》

专知会员服务

60+阅读 · 2023年4月23日