In real-world applications, machine learning models often become obsolete due to shifts in the joint distribution arising from underlying temporal trends, a phenomenon known as the "concept drift". Existing works propose model-specific strategies to achieve temporal generalization in the near-future domain. However, the diverse characteristics of real-world datasets necessitate customized prediction model architectures. To this end, there is an urgent demand for a model-agnostic temporal domain generalization approach that maintains generality across diverse data modalities and architectures. In this work, we aim to address the concept drift problem from a data-centric perspective to bypass considering the interaction between data and model. Developing such a framework presents non-trivial challenges: (i) existing generative models struggle to generate out-of-distribution future data, and (ii) precisely capturing the temporal trends of joint distribution along chronological source domains is computationally infeasible. To tackle the challenges, we propose the COncept Drift simulAtor (CODA) framework incorporating a predicted feature correlation matrix to simulate future data for model training. Specifically, CODA leverages feature correlations to represent data characteristics at specific time points, thereby circumventing the daunting computational costs. Experimental results demonstrate that using CODA-generated data as training input effectively achieves temporal domain generalization across different model architectures.
翻译:在实际应用中,机器学习模型常因隐式时间趋势导致的联合分布偏移而失效,这一现象被称为“概念漂移”。现有研究针对特定模型提出了实现近未来域时间泛化的策略,但真实数据集的多样化特性要求定制化的预测模型架构。因此,亟需一种与模型无关的时间域泛化方法,以保持对不同数据模态和架构的普适性。本研究从数据驱动的角度出发,旨在规避数据与模型交互的复杂性,解决概念漂移问题。然而,构建此类框架面临两大非平凡挑战:(i) 现有生成模型难以生成分布于未来域外(out-of-distribution)的数据,(ii) 精确捕获按时间顺序排列的源域中联合分布的时间趋势在计算上不可行。为此,我们提出概念漂移模拟器(CODA)框架,通过引入预测的特征相关性矩阵生成未来数据用于模型训练。具体而言,CODA利用特征相关性表征特定时间点的数据特征,从而规避了高昂的计算成本。实验结果表明,将CODA生成的数据作为训练输入,可有效实现不同模型架构间的时间域泛化。