Publishing and sharing data is crucial for the data mining community, allowing collaboration and driving open innovation. However, many researchers cannot release their data due to privacy regulations or fear of leaking confidential business information. To alleviate such issues, we propose the Time Series Synthesis Using the Matrix Profile (TSSUMP) method, where synthesized time series can be released in lieu of the original data. The TSSUMP method synthesizes time series by preserving similarity join information (i.e., Matrix Profile) while reducing the correlation between the synthesized and the original time series. As a result, neither the values for the individual time steps nor the local patterns (or shapes) from the original data can be recovered, yet the resulting data can be used for downstream tasks that data analysts are interested in. We concentrate on similarity joins because they are one of the most widely applied time series data mining routines across different data mining tasks. We test our method on a case study of ECG and gender masking prediction. In this case study, the gender information is not only removed from the synthesized time series, but the synthesized time series also preserves enough information from the original time series. As a result, unmodified data mining tools can obtain near-identical performance on the synthesized time series as on the original time series.
翻译:数据发布与共享对数据挖掘领域至关重要,它能够促进协作并推动开放式创新。然而,许多研究者因隐私法规或商业机密泄露的顾虑而无法公开其数据。为缓解这一问题,我们提出基于矩阵轮廓的时间序列合成(TSSUMP)方法,该方法允许用合成的时间序列替代原始数据进行发布。TSSUMP方法通过保留相似性连接信息(即矩阵轮廓)的同时降低合成数据与原始时间序列之间的相关性来生成时间序列。由此,既无法恢复原始数据中单个时间步的数值,也无法还原局部模式(或形状),但生成的数据仍可用于数据分析师所关注的后续分析任务。我们聚焦于相似性连接,因为它是跨不同数据挖掘任务中使用最广泛的时间序列数据挖掘例程之一。在心电图(ECG)与性别掩码预测的案例研究中,我们对该方法进行了验证:合成时间序列不仅移除了性别信息,还充分保留了原始数据的特征。实验结果表明,未经修改的数据挖掘工具在合成时间序列上可获得与原始数据近乎一致的性能表现。