This manuscript provides a systemic and data-centric view of what we term essential data science, as a natural ecosystem with challenges and missions stemming from the fusion of data universe with its multiple combinations of the 5D complexities (data structure, domain, cardinality, causality, and ethics) with the phases of the data life cycle. Data agents perform tasks driven by specific goals. The data scientist is an abstract entity that comes from the logical organization of data agents with their actions. Data scientists face challenges that are defined according to the missions. We define specific discipline-induced data science, which in turn allows for the definition of pan-data science, a natural ecosystem that integrates specific disciplines with the essential data science. We semantically split the essential data science into computational, and foundational. By formalizing this ecosystemic view, we contribute a general-purpose, fusion-oriented architecture for integrating heterogeneous knowledge, agents, and workflows-relevant to a wide range of disciplines and high-impact applications.
翻译:本手稿提供了一个系统化且以数据为中心的视角,阐述我们称之为“核心数据科学”的概念,将其视为一个自然生态系统。该生态系统的挑战与使命源于数据宇宙与五维复杂性(数据结构、领域、基数、因果性与伦理性)的多重组合,以及数据生命周期的各个阶段的融合。数据代理执行由特定目标驱动的任务。数据科学家是一个抽象实体,源于数据代理及其行为的逻辑组织。数据科学家面临的挑战根据使命而定义。我们定义了特定学科驱动的数据科学,进而允许定义泛数据科学——一个将特定学科与核心数据科学相整合的自然生态系统。我们在语义上将核心数据科学划分为计算层面与基础层面。通过形式化这一生态系统视角,我们提出了一种通用、融合导向的架构,用于整合异构知识、代理与工作流,该架构适用于广泛的学科领域与高影响力应用。