The continuous expansion of open data platforms and research repositories has led to a fragmented dataset ecosystem, posing significant challenges for cross-source data discovery and interpretation. To address these challenges, we introduce SeDa--a unified framework for dataset discovery, semantic annotation, and multi-entity augmented navigation. SeDa integrates more than 7.6 million datasets from over 200 platforms, spanning governmental, academic, and industrial domains. The framework first performs semantic extraction and standardization to harmonize heterogeneous metadata representations. On this basis, a topic-tagging mechanism constructs an extensible tag graph that supports thematic retrieval and cross-domain association, while a provenance assurance module embedded within the annotation process continuously validates dataset sources and monitors link availability to ensure reliability and traceability. Furthermore, SeDa employs a multi-entity augmented navigation strategy that organizes datasets within a knowledge space of sites, institutions, and enterprises, enabling contextual and provenance-aware exploration beyond traditional search paradigms. Comparative experiments with popular dataset search platforms, such as ChatPD and Google Dataset Search, demonstrate that SeDa achieves superior coverage, timeliness, and traceability. Taken together, SeDa establishes a foundation for trustworthy, semantically enriched, and globally scalable dataset exploration.
翻译:随着开放数据平台与研究存储库的持续扩张,数据集生态系统日益碎片化,给跨源数据发现与解读带来了重大挑战。为应对这些挑战,我们提出了SeDa——一个集数据集发现、语义标注与多实体增强导航于一体的统一框架。SeDa整合了来自200多个平台的超过760万个数据集,涵盖政府、学术及工业领域。该框架首先执行语义提取与标准化,以协调异构的元数据表示。在此基础上,一个主题标注机制构建了可扩展的标签图,支持主题检索与跨域关联;同时,嵌入在标注过程中的溯源保障模块持续验证数据集来源并监测链接可用性,以确保可靠性与可追溯性。此外,SeDa采用了一种多实体增强导航策略,将数据集组织在站点、机构与企业构成的知识空间中,实现了超越传统搜索范式的上下文感知与溯源感知的探索。与ChatPD、Google Dataset Search等主流数据集搜索平台的对比实验表明,SeDa在覆盖范围、时效性与可追溯性方面均表现更优。综上所述,SeDa为可信、语义丰富且具备全球可扩展性的数据集探索奠定了基础。