Forecasting competitions are of increasing importance as a means to learn best practices and gain knowledge. Data leakage is one of the most common issues that can often be found in competitions. Data leaks can happen when the training data contains information about the test data. There are a variety of different ways that data leaks can occur with time series data. For example: i) randomly chosen blocks of time series are concatenated to form a new time series; ii) scale-shifts; iii) repeating patterns in time series; iv) white noise is added to the original time series to form a new time series, etc. This work introduces a novel tool to detect these data leaks. The tsdataleaks package provides a simple and computationally efficient algorithm to exploit data leaks in time series data. This paper demonstrates the package design and its power to detect data leakages with an application to forecasting competition data.
翻译:预测竞赛作为学习最佳实践和获取知识的手段,其重要性日益凸显。数据泄露是竞赛中常见的问题之一。当训练数据包含测试数据的信息时,就可能发生数据泄露。在时间序列数据中,数据泄露可能通过多种方式发生,例如:i)随机选取的时间序列块被拼接形成新的时间序列;ii)尺度偏移;iii)时间序列中的重复模式;iv)在原时间序列中加入白噪声形成新时间序列等。本文介绍了一种检测此类数据泄露的新工具。tsdataleaks包提供了一种简单且计算高效的算法,用于识别时间序列数据中的数据泄露。本文通过应用于预测竞赛数据的实例,展示了该包的设计及其检测数据泄露的能力。