The proliferation of automated data collection schemes and the advances in sensorics are increasing the amount of data we are able to monitor in real-time. However, given the high annotation costs and the time required by quality inspections, data is often available in an unlabeled form. This is fostering the use of active learning for the development of soft sensors and predictive models. In production, instead of performing random inspections to obtain product information, labels are collected by evaluating the information content of the unlabeled data. Several query strategy frameworks for regression have been proposed in the literature but most of the focus has been dedicated to the static pool-based scenario. In this work, we propose a new strategy for the stream-based scenario, where instances are sequentially offered to the learner, which must instantaneously decide whether to perform the quality check to obtain the label or discard the instance. The approach is inspired by the optimal experimental design theory and the iterative aspect of the decision-making process is tackled by setting a threshold on the informativeness of the unlabeled data points. The proposed approach is evaluated using numerical simulations and the Tennessee Eastman Process simulator. The results confirm that selecting the examples suggested by the proposed algorithm allows for a faster reduction in the prediction error.
翻译:自动化数据收集方案的普及以及传感器技术的进步,使得我们能够实时监控的数据量不断增加。然而,考虑到高昂的标注成本和质量检测所需的时间,数据往往以无标签形式存在,这推动了主动学习在软测量与预测模型开发中的应用。在生产过程中,并非通过随机检测来获取产品信息,而是通过评估无标签数据的信息含量来收集标签。文献中已提出多种回归查询策略框架,但大部分研究集中于基于静态池的场景。本文针对基于流的场景提出一种新策略:在此场景中,样本按顺序提供给学习器,学习器必须即时决定是否进行质量检测以获取标签,或丢弃该样本。该方法受最优实验设计理论启发,通过设定无标签数据点信息量的阈值来处理决策过程的迭代特性。通过数值仿真和Tennessee Eastman过程仿真器对所提方法进行评估,结果表明,选择该算法建议的样本能够更快地降低预测误差。