Compressive learning is an emerging approach to drastically reduce the memory footprint of large-scale learning, by first summarizing a large dataset into a low-dimensional sketch vector, and then decoding from this sketch the latent information needed for learning. In light of recent progress on information preservation guarantees for sketches based on random features, a major objective is to design easy-to-tune algorithms (called decoders) to robustly and efficiently extract this information. To address the underlying non-convex optimization problems, various heuristics have been proposed. In the case of compressive clustering, the standard heuristic is CL-OMPR, a variant of sliding Frank-Wolfe. Yet, CL-OMPR is hard to tune, and the examination of its robustness was overlooked. In this work, we undertake a scrutinized examination of CL-OMPR to circumvent its limitations. In particular, we show how this algorithm can fail to recover the clusters even in advantageous scenarios. To gain insight, we show how the deficiencies of this algorithm can be attributed to optimization difficulties related to the structure of a correlation function appearing at core steps of the algorithm. To address these limitations, we propose an alternative decoder offering substantial improvements over CL-OMPR. Its design is notably inspired from the mean shift algorithm, a classic approach to detect the local maxima of kernel density estimators. The proposed algorithm can extract clustering information from a sketch of the MNIST dataset that is 10 times smaller than previously.
翻译:压缩学习是一种新兴方法,通过首先将大规模数据集概括为低维素描向量,然后从该素描中解码学习所需的潜在信息,从而大幅减少大规模学习的内存占用。鉴于基于随机特征的素描在信息保存保证方面取得的最新进展,一个主要目标是设计易于调参的算法(称为解码器),以鲁棒且高效地提取这些信息。针对底层的非凸优化问题,已有多种启发式方法被提出。在压缩聚类的情况下,标准启发式方法是CL-OMPR,即滑动Frank-Wolfe算法的一种变体。然而,CL-OMPR难以调参,且其鲁棒性检验被忽视。在本工作中,我们对CL-OMPR进行细致检验以规避其局限性。特别地,我们展示了即使在有利场景下,该算法也可能无法恢复聚类。为深入理解,我们揭示了该算法的缺陷可归因于优化困难,这些困难与算法核心步骤中出现的相关函数结构有关。为克服这些限制,我们提出了一种替代解码器,相比CL-OMPR实现了显著改进。其设计灵感主要来源于均值偏移算法——一种检测核密度估计器局部最大值的经典方法。所提算法能够从MNIST数据集的素描中提取聚类信息,该素描尺寸比先前所用的小10倍。