Raw datasets are often too large and unstructured to work with directly, and require a data preparation phase. The domain of industrial Cyber-Physical Systems (CPSs) is no exception, as raw data typically consists of large time-series data collections that log the system's status at regular time intervals. The processing of such raw data is often carried out using ad hoc, case-specific, one-off Python scripts, often neglecting aspects of readability, reusability, and maintainability. In practice, this can cause professionals such as data scientists to write similar data preparation scripts for each case, requiring them to do much repetitive work. We introduce CPSLint, a Domain-Specific Language (DSL) designed to support the data preparation process for industrial CPS. CPSLint raises the level of abstraction to the point where both data scientists and domain experts can perform the data preparation task. We leverage the fact that many raw data collections in the industrial CPS domain require similar actions to render them suitable for data-centric workflows. In our DSL one can express the data preparation process in just a few lines of code. CPSLint is a publicly available tool applicable for any case involving time-series data collections in need of sanitisation.
翻译:原始数据集通常因规模过大且结构松散而难以直接使用,需要经过数据准备阶段。工业信息物理系统(CPSs)领域同样面临这一挑战,其原始数据通常包含以固定时间间隔记录系统状态的大型时序数据集合。此类原始数据的处理往往依赖临时编写、针对特定案例的一次性Python脚本,而忽略可读性、可复用性与可维护性。实践中,这常导致数据科学家等专业人员需为每个案例编写相似的数据准备脚本,重复性工作繁重。本文提出CPSLint——一种专为工业CPS数据准备流程设计的领域特定语言(DSL)。CPSLint将抽象层次提升至数据科学家与领域专家均可执行数据准备任务的高度。我们充分利用工业CPS领域大量原始数据集合需执行相似操作方可适配数据驱动工作流的特性,使研究者仅需数行代码即可通过该DSL表述数据准备流程。CPSLint作为公开可用工具,适用于任何需要净化的时序数据集合处理场景。