The proliferation of automated data collection schemes and the advances in sensorics are increasing the amount of data we are able to monitor in real-time. However, given the high annotation costs and the time required by quality inspections, data is often available in an unlabeled form. This is fostering the use of active learning for the development of soft sensors and predictive models. In production, instead of performing random inspections to obtain product information, labels are collected by evaluating the information content of the unlabeled data. Several query strategy frameworks for regression have been proposed in the literature but most of the focus has been dedicated to the static pool-based scenario. In this work, we propose a new strategy for the stream-based scenario, where instances are sequentially offered to the learner, which must instantaneously decide whether to perform the quality check to obtain the label or discard the instance. The approach is inspired by the optimal experimental design theory and the iterative aspect of the decision-making process is tackled by setting a threshold on the informativeness of the unlabeled data points. The proposed approach is evaluated using numerical simulations and the Tennessee Eastman Process simulator. The results confirm that selecting the examples suggested by the proposed algorithm allows for a faster reduction in the prediction error.
翻译:自动数据采集方案的普及以及传感器技术的发展,极大地增加了我们能够实时监控的数据量。然而,由于高昂的标注成本和质量检测所需的时间,数据通常以无标签形式存在。这一现状推动了主动学习在软传感器和预测模型开发中的应用。在生产过程中,并非通过随机抽检获取产品信息,而是通过评估无标签数据的信息含量来收集标签。文献中已提出了多种回归查询策略框架,但大部分研究集中于基于静态池的场景。本文针对基于数据流的场景提出了一种新策略,在该场景中,实例被依次提供给学习器,学习器必须即时决定是进行质量检测以获取标签,还是丢弃该实例。该方法受最优实验设计理论启发,通过为无标签数据点的信息量设定阈值来处理决策过程的迭代特性。利用数值仿真和田纳西-伊斯曼过程仿真器对所提方法进行了评估,结果证实,选择本算法建议的样本能够更快地降低预测误差。