Masked autoencoders (MAEs) have recently demonstrated effectiveness in tabular data imputation. However, due to the inherent heterogeneity of tabular data, the uniform random masking strategy commonly used in MAEs can disrupt the distribution of missingness, leading to suboptimal performance. To address this, we propose a proportional masking strategy for MAEs. Specifically, we first compute the statistics of missingness based on the observed proportions in the dataset, and then generate masks that align with these statistics, ensuring that the distribution of missingness is preserved after masking. Furthermore, we argue that simple MLP-based token mixing offers competitive or often superior performance compared to attention mechanisms while being more computationally efficient, especially in the tabular domain with the inherent heterogeneity. Experimental results validate the effectiveness of the proposed proportional masking strategy across various missing data patterns in tabular datasets. Code is available at: \url{https://github.com/normal-kim/PMAE}.
翻译:掩码自编码器(MAE)最近在表格数据填补任务中展现出显著效果。然而,由于表格数据固有的异质性,MAE通常采用的均匀随机掩码策略可能破坏缺失分布的原有模式,导致性能欠佳。为解决这一问题,我们提出一种面向MAE的比例掩码策略。具体而言,我们首先基于数据集中观测到的缺失比例计算缺失统计量,随后生成与这些统计量对齐的掩码,从而确保掩码操作后缺失分布得以保持。此外,我们认为基于多层感知机的简单令牌混合机制相较于注意力机制能提供相当或更优的性能,同时具有更高的计算效率,这一优势在具有固有异质性的表格数据领域尤为明显。实验结果验证了所提出的比例掩码策略在多种表格数据集缺失模式下的有效性。代码发布于:\url{https://github.com/normal-kim/PMAE}。