A current assumption of most clustering methods is that the training data and future data are taken from the same distribution. However, this assumption may not hold in some real-world scenarios. In this paper, we propose an importance sampling based deterministic annealing approach (ISDA) for clustering problems which minimizes the worst case of expected distortions under the constraint of distribution deviation. The distribution deviation constraint can be converted to the constraint over a set of weight distributions centered on the uniform distribution derived from importance sampling. The objective of the proposed approach is to minimize the loss under maximum degradation hence the resulting problem is a constrained minimax optimization problem which can be reformulated to an unconstrained problem using the Lagrange method and be solved by the quasi-newton algorithm. Experiment results on synthetic datasets and a real-world load forecasting problem validate the effectiveness of the proposed ISDA. Furthermore, we show that fuzzy c-means is a special case of ISDA with the logarithmic distortion. This observation sheds a new light on the relationship between fuzzy c-means and deterministic annealing clustering algorithms and provides an interesting physical and information-theoretical interpretation for fuzzy exponent $m$.
翻译:当前大多数聚类方法的一个假设是训练数据与未来数据来自相同的分布。然而,这一假设在现实场景中可能不成立。本文针对聚类问题提出了一种基于重要性采样的确定性退火方法(ISDA),该方法在分布偏差约束下最小化最坏情况下的期望失真。分布偏差约束可转化为对一组以均匀分布为中心(由重要性采样导出)的权重分布施加的约束。所提方法的目标是最小化最大退化下的损失,因此该问题可表述为受约束的极小极大优化问题,通过拉格朗日方法转化为无约束问题,并采用拟牛顿算法求解。在合成数据集和真实负荷预测问题上的实验结果验证了所提ISDA的有效性。此外,我们证明模糊C均值算法是采用对数失真的ISDA的一个特例。这一观察为模糊C均值与确定性退火聚类算法之间的关系提供了新视角,并为模糊指数$m$给出了有趣的物理与信息论解释。