A current assumption of most clustering methods is that the training data and future data are taken from the same distribution. However, this assumption may not hold in most real-world scenarios. In this paper, we propose an information theoretical importance sampling based approach for clustering problems (ITISC) which minimizes the worst case of expected distortions under the constraint of distribution deviation. The distribution deviation constraint can be converted to the constraint over a set of weight distributions centered on the uniform distribution derived from importance sampling. The objective of the proposed approach is to minimize the loss under maximum degradation hence the resulting problem is a constrained minimax optimization problem which can be reformulated to an unconstrained problem using the Lagrange method. The optimization problem can be solved by both an alternative optimization algorithm or a general optimization routine by commercially available software. Experiment results on synthetic datasets and a real-world load forecasting problem validate the effectiveness of the proposed model. Furthermore, we show that fuzzy c-means is a special case of ITISC with the logarithmic distortion, and this observation provides an interesting physical interpretation for fuzzy exponent $m$.
翻译:当前多数聚类方法的一个潜在假设是训练数据与未来数据源自同一分布。然而,在大多数现实场景中,这一假设可能不成立。本文提出一种基于信息论重要性采样的聚类方法(ITISC),该方法在分布偏差约束下最小化期望失真的最坏情况。分布偏差约束可转化为对一组基于均匀分布推导出的权重分布集合的约束,该均匀分布来自重要性采样。所提方法的目标是最小化最大退化下的损失,因此最终问题是一个约束极小极大优化问题,可通过拉格朗日方法重构为无约束问题。该优化问题既可通过交替优化算法求解,也可通过商业软件的通用优化程序实现。在合成数据集及真实负荷预测问题上的实验结果验证了所提模型的有效性。此外,我们证明模糊C均值算法是对数失真下ITISC的特例,这一发现为模糊指数$m$提供了有趣的物理解释。