We describe a measure quantization procedure i.e., an algorithm which finds the best approximation of a target probability law (and more generally signed finite variation measure) by a sum of $Q$ Dirac masses ($Q$ being the quantization parameter). The procedure is implemented by minimizing the statistical distance between the original measure and its quantized version; the distance is built from a negative definite kernel and, if necessary, can be computed on the fly and feed to a stochastic optimization algorithm (such as SGD, Adam, ...). We investigate theoretically the fundamental questions of existence of the optimal measure quantizer and identify what are the required kernel properties that guarantee suitable behavior. We propose two best linear unbiased (BLUE) estimators for the squared statistical distance and use them in an unbiased procedure, called HEMQ, to find the optimal quantization. We test HEMQ on several databases: multi-dimensional Gaussian mixtures, Wiener space cubature, Italian wine cultivars and the MNIST image database. The results indicate that the HEMQ algorithm is robust and versatile and, for the class of Huber-energy kernels, matches the expected intuitive behavior.
翻译:我们描述了一种测度量化过程,即一种算法,用于寻找目标概率律(更一般地,带符号有限变差测度)的最优逼近,该逼近由$Q$个狄拉克质量之和($Q$为量化参数)构成。该过程通过最小化原始测度与其量化版本之间的统计距离来实现;该距离基于负定核函数构建,必要时可以在线计算并输入随机优化算法(如SGD、Adam等)。我们从理论上研究了最优测度量化器存在的基本问题,并确定了确保合适行为所需的核函数性质。我们提出了两种平方统计距离的最佳线性无偏(BLUE)估计量,并将其用于称为HEMQ的无偏过程中以寻找最优量化。我们在多个数据库上测试了HEMQ:多维高斯混合分布、Wiener空间立方体求积、意大利葡萄酒品种以及MNIST图像数据库。结果表明,HEMQ算法鲁棒且通用,且对于Huber能量核类,其行为符合预期的直观性质。