Imputation methods play a critical role in enhancing the quality of practical time-series data, which often suffer from pervasive missing values. Recently, diffusion-based generative imputation methods have demonstrated remarkable success compared to autoregressive and conventional statistical approaches. Despite their empirical success, the theoretical understanding of how well diffusion-based models capture complex spatial and temporal dependencies between the missing values and observed ones remains limited. Our work addresses this gap by investigating the statistical efficiency of conditional diffusion transformers for imputation and quantifying the uncertainty in missing values. Specifically, we derive statistical sample complexity bounds based on a novel approximation theory for conditional score functions using transformers, and, through this, construct tight confidence regions for missing values. Our findings also reveal that the efficiency and accuracy of imputation are significantly influenced by the missing patterns. Furthermore, we validate these theoretical insights through simulation and propose a mixed-masking training strategy to enhance the imputation performance.
翻译:数据填补方法在提升实际时间序列数据质量方面发挥着关键作用,这类数据常因普遍存在的缺失值而受到影响。近期,基于扩散的生成式填补方法相较于自回归和传统统计方法已展现出显著的成功。尽管其经验性成果显著,但关于扩散模型如何有效捕捉缺失值与观测值之间复杂时空依赖关系的理论理解仍较为有限。本研究通过探究条件扩散变换器在数据填补中的统计效率并量化缺失值的不确定性,以弥补这一空白。具体而言,我们基于一种新颖的、利用变换器对条件评分函数进行逼近的理论,推导出统计样本复杂度界限,并据此构建了缺失值的紧致置信区域。我们的研究结果还表明,填补的效率和准确性显著受缺失模式的影响。此外,我们通过仿真验证了这些理论见解,并提出了一种混合掩码训练策略以提升填补性能。