In learning theory, a standard assumption is that the data is generated from a finite mixture model. But what happens when the number of components is not known in advance? The problem of estimating the number of components, also called model selection, is important in its own right but there are essentially no known efficient algorithms with provable guarantees let alone ones that can tolerate adversarial corruptions. In this work, we study the problem of robust model selection for univariate Gaussian mixture models (GMMs). Given $\textsf{poly}(k/\epsilon)$ samples from a distribution that is $\epsilon$-close in TV distance to a GMM with $k$ components, we can construct a GMM with $\widetilde{O}(k)$ components that approximates the distribution to within $\widetilde{O}(\epsilon)$ in $\textsf{poly}(k/\epsilon)$ time. Thus we are able to approximately determine the minimum number of components needed to fit the distribution within a logarithmic factor. Prior to our work, the only known algorithms for learning arbitrary univariate GMMs either output significantly more than $k$ components (e.g. $k/\epsilon^2$ components for kernel density estimates) or run in time exponential in $k$. Moreover, by adapting our techniques we obtain similar results for reconstructing Fourier-sparse signals.
翻译:在学习理论中,一个标准假设是数据由有限混合模型生成。但当分量数量事先未知时会发生什么?估计分量数量的问题(也称为模型选择)本身具有重要意义,但本质上不存在已知的具有可证明保证的高效算法,更不用说能够容忍对抗性破坏的算法。在本工作中,我们研究单变量高斯混合模型(GMM)的鲁棒模型选择问题。给定来自一个在总变差距离上与含$k$个分量的GMM$\epsilon$-接近的分布的$\textsf{poly}(k/\epsilon)$个样本,我们能在$\textsf{poly}(k/\epsilon)$时间内构造一个含$\widetilde{O}(k)$个分量的GMM,使得该GMM在$\widetilde{O}(\epsilon)$误差内逼近该分布。因此,我们能够在对数因子内近似确定拟合该分布所需的最小分量数量。在我们工作之前,已知的学习任意单变量GMM的算法要么输出远多于$k$个分量(例如核密度估计中的$k/\epsilon^2$个分量),要么运行时间指数级于$k$。此外,通过调整我们的技术,我们在重构傅里叶稀疏信号方面获得了类似结果。