Raw datasets are often too large and unstructured to work with directly, and require a data preparation phase. The domain of industrial Cyber-Physical Systems (CPSs) is no exception, as raw data typically consists of large time-series data collections that log the system's status at regular time intervals. The processing of such raw data is often carried out using ad hoc, case-specific, one-off Python scripts, often neglecting aspects of readability, reusability, and maintainability. In practice, this can cause professionals such as data scientists to write similar data preparation scripts for each case, requiring them to do much repetitive work. We introduce CPSLint, a Domain-Specific Language (DSL) designed to support the data preparation process for industrial CPS. CPSLint raises the level of abstraction to the point where both data scientists and domain experts can perform the data preparation task. We leverage the fact that many raw data collections in the industrial CPS domain require similar actions to render them suitable for data-centric workflows. In our DSL one can express the data preparation process in just a few lines of code. CPSLint is a publicly available tool applicable for any case involving time-series data collections in need of sanitisation.
翻译:原始数据集往往过于庞大且缺乏结构,无法直接使用,因此需要数据预处理阶段。工业信息物理系统(CPS)领域也不例外,原始数据通常包含以固定时间间隔记录系统状态的大规模时序数据集合。这类原始数据的处理通常采用临时性、特定场景的一次性Python脚本,往往忽略了可读性、可重用性和可维护性。实践中,这会导致数据科学家等专业人员需要为每个案例编写相似的数据预处理脚本,不得不执行大量重复性工作。我们提出CPSLint——一种专为支持工业CPS数据预处理而设计的领域特定语言(DSL)。CPSLint将抽象层次提升至数据科学家和领域专家均能执行数据预处理任务的程度。我们利用工业CPS领域许多原始数据集合需要相似操作才能适用于数据驱动工作流这一特点。通过该DSL,用户仅需数行代码即可表达数据预处理流程。CPSLint是一款面向公众可用的工具,适用于任何需要清洗的时序数据集合场景。