Existing distribution compression methods reduce the number of observations in a dataset by minimising the Maximum Mean Discrepancy (MMD) between original and compressed sets, but modern datasets are often large in both sample size and dimensionality. We propose Bilateral Distribution Compression (BDC), a two-stage framework that compresses along both axes while preserving the underlying distribution, with overall linear time and memory complexity in dataset size and dimension. Central to BDC is the Decoded MMD (DMMD), which we introduce to quantify the discrepancy between the original data and a compressed set decoded from a low-dimensional latent space. BDC proceeds by (i) learning a low-dimensional projection using the Reconstruction MMD (RMMD), and (ii) optimising a latent compressed set with the Encoded MMD (EMMD). We show that this procedure minimises the DMMD, guaranteeing that the compressed set faithfully represents the original distribution. Experiments show that BDC can achieve comparable or superior downstream task performance to ambient-space compression at substantially lower cost and with significantly higher rates of compression.
翻译:现有的分布压缩方法通过最小化原始数据集与压缩集之间的最大均值差异来减少观测样本数量,然而现代数据集通常在样本规模和维度上都非常庞大。我们提出双边分布压缩,这是一个两阶段框架,可在保持底层分布的同时沿两个轴向进行压缩,其时间和内存复杂度在数据集规模和维度上均为线性。BDC的核心是解码最大均值差异,我们引入该指标来量化原始数据与从低维潜在空间解码得到的压缩集之间的差异。BDC的执行流程包括:(i)使用重构最大均值差异学习低维投影,以及(ii)通过编码最大均值差异优化潜在压缩集。我们证明该流程能够最小化DMMD,从而保证压缩集能忠实表征原始分布。实验表明,BDC在显著降低计算成本且实现更高压缩率的情况下,能够达到与原空间压缩方法相当或更优的下游任务性能。