Forecasting could be negatively impacted due to anonymization requirements in data protection legislation. To measure the potential severity of this problem, we derive theoretical bounds for the loss to forecasts from additive exponential smoothing models using protected data. Following the guidelines of anonymization from the General Data Protection Regulation (GDPR) and California Consumer Privacy Act (CCPA), we develop the $k$-nearest Time Series ($k$-nTS) Swapping and $k$-means Time Series ($k$-mTS) Shuffling methods to create protected time series data that minimizes the loss to forecasts while preventing a data intruder from detecting privacy issues. For efficient and effective decision making, we formally model an integer programming problem for a perfect matching for simultaneous data swapping in each cluster. We call it a two-party data privacy framework since our optimization model includes the utilities of a data provider and data intruder. We apply our data protection methods to thousands of time series and find that it maintains the forecasts and patterns (level, trend, and seasonality) of time series well compared to standard data protection methods suggested in legislation. Substantively, our paper addresses the challenge of protecting time series data when used for forecasting. Our findings suggest the managerial importance of incorporating the concerns of forecasters into the data protection itself.
翻译:预测可能因数据保护立法中的匿名化要求而受到负面影响。为衡量该问题的潜在严重性,我们推导了使用受保护数据的加性指数平滑模型预测损失的理论界限。依据《通用数据保护条例》(GDPR)和《加州消费者隐私法案》(CCPA)的匿名化指引,我们提出了k近邻时间序列(k-nTS)置换法和k均值时间序列(k-mTS)混洗法,在防止数据入侵者检测隐私问题的同时,生成最小化预测损失的保护后时间序列数据。为实现高效决策,我们为每个聚类中同时进行数据置换的完美匹配问题建立了整数规划模型。由于该优化模型同时考虑了数据提供方和数据入侵方的效用,我们将其称为两方数据隐私框架。我们将提出的数据保护方法应用于数千条时间序列,发现与法律建议的标准数据保护方法相比,该方法能更好地保持时间序列的预测结果及模式(水平、趋势、季节性)。本研究实质性地解决了预测场景中时间序列数据的保护难题,揭示了将预测人员关切纳入数据保护本身的管理重要性。