Raw datasets are often too large and unstructured to work with directly, and require a data preparation process. The domain of industrial Cyber-Physical Systems (CPS) is no exception, as raw data typically consists of large amounts of time-series data logging the system's status in regular time intervals. Such data has to be sanity checked and preprocessed to be consumable by data-centric workflows. We introduce CPSLint, a Domain-Specific Language designed to provide data preparation for industrial CPS. We build up on the fact that many raw data collections in the CPS domain require similar actions to render them suitable for Machine-Learning (ML) solutions, e.g., Fault Detection and Identification (FDI) workflows, yet still vary enough to hope for one universally applicable solution. CPSLint's main features include type checking and enforcing constraints through validation and remediation for data columns, such as imputing missing data from surrounding rows. More advanced features cover inference of extra CPS-specific data structures, both column-wise and row-wise. For instance, as row-wise structures, descriptive execution phases are an effective method of data compartmentalisation are extracted and prepared for ML-assisted FDI workflows. We demonstrate CPSLint's features through a proof of concept implementation.
翻译:原始数据集通常过于庞大且非结构化,难以直接使用,因此需要数据准备过程。工业信息物理系统领域也不例外,其原始数据通常包含大量按固定时间间隔记录系统状态的时间序列数据。此类数据必须经过合理性检查与预处理,才能被以数据为中心的工作流所使用。本文介绍CPSLint,这是一种专为工业信息物理系统数据准备而设计的领域特定语言。我们的研究基于以下事实:CPS领域中的许多原始数据集合需要相似的操作才能适用于机器学习解决方案(例如故障检测与识别工作流),但其差异性又足以让我们期待一种通用解决方案的出现。CPSLint的主要功能包括类型检查,以及通过验证和修复机制对数据列实施约束(例如根据相邻行数据填补缺失值)。更高级的功能涵盖对额外CPS特定数据结构(包括列方向和行方向)的推断。例如,作为行方向结构,描述性执行阶段作为有效的数据分区方法被提取并准备用于ML辅助的FDI工作流。我们通过概念验证实现展示了CPSLint的各项功能。