CondTSF: One-line Plugin of Dataset Condensation for Time Series Forecasting

Dataset condensation is a newborn technique that generates a small dataset that can be used in training deep neural networks to lower training costs. The objective of dataset condensation is to ensure that the model trained with the synthetic dataset can perform comparably to the model trained with full datasets. However, existing methods predominantly concentrate on classification tasks, posing challenges in their adaptation to time series forecasting (TS-forecasting). This challenge arises from disparities in the evaluation of synthetic data. In classification, the synthetic data is considered well-distilled if the model trained with the full dataset and the model trained with the synthetic dataset yield identical labels for the same input, regardless of variations in output logits distribution. Conversely, in TS-forecasting, the effectiveness of synthetic data distillation is determined by the distance between predictions of the two models. The synthetic data is deemed well-distilled only when all data points within the predictions are similar. Consequently, TS-forecasting has a more rigorous evaluation methodology compared to classification. To mitigate this gap, we theoretically analyze the optimization objective of dataset condensation for TS-forecasting and propose a new one-line plugin of dataset condensation designated as Dataset Condensation for Time Series Forecasting (CondTSF) based on our analysis. Plugging CondTSF into previous dataset condensation methods facilitates a reduction in the distance between the predictions of the model trained with the full dataset and the model trained with the synthetic dataset, thereby enhancing performance. We conduct extensive experiments on eight commonly used time series datasets. CondTSF consistently improves the performance of all previous dataset condensation methods across all datasets, particularly at low condensing ratios.

翻译：数据集压缩是一种新兴技术，它通过生成一个可用于训练深度神经网络的小型数据集来降低训练成本。数据集压缩的目标是确保使用合成数据集训练的模型能够与使用完整数据集训练的模型性能相当。然而，现有方法主要集中于分类任务，这使其难以适应时间序列预测。这一挑战源于合成数据评估方式的差异。在分类任务中，只要使用完整数据集训练的模型和使用合成数据集训练的模型对相同输入产生相同的标签，无论其输出逻辑值分布如何不同，该合成数据都被认为是充分提炼的。相反，在时间序列预测中，合成数据提炼的有效性由两个模型预测结果之间的距离决定。只有当预测序列中的所有数据点都相似时，合成数据才被视为充分提炼。因此，与分类任务相比，时间序列预测的评估方法更为严格。为弥补这一差距，我们从理论上分析了时间序列预测中数据集压缩的优化目标，并基于此分析提出了一种新的数据集压缩单行插件，命名为用于时间序列预测的数据集压缩。将CondTSF插件集成到先前的数据集压缩方法中，有助于减小使用完整数据集训练的模型与使用合成数据集训练的模型之间的预测距离，从而提升性能。我们在八个常用的时间序列数据集上进行了广泛的实验。CondTSF在所有数据集上持续改进了所有先前数据集压缩方法的性能，尤其是在低压缩比条件下。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日