This paper addresses the limitations of conventional vector quantization algorithms, particularly K-Means and its variant K-Means++, and investigates the Stochastic Quantization (SQ) algorithm as a scalable alternative for high-dimensional unsupervised and semi-supervised learning tasks. Traditional clustering algorithms often suffer from inefficient memory utilization during computation, necessitating the loading of all data samples into memory, which becomes impractical for large-scale datasets. While variants such as Mini-Batch K-Means partially mitigate this issue by reducing memory usage, they lack robust theoretical convergence guarantees due to the non-convex nature of clustering problems. In contrast, the Stochastic Quantization algorithm provides strong theoretical convergence guarantees, making it a robust alternative for clustering tasks. We demonstrate the computational efficiency and rapid convergence of the algorithm on an image classification problem with partially labeled data, comparing model accuracy across various ratios of labeled to unlabeled data. To address the challenge of high dimensionality, we employ a Triplet Network to encode images into low-dimensional representations in a latent space, which serve as a basis for comparing the efficiency of both the Stochastic Quantization algorithm and traditional quantization algorithms. Furthermore, we enhance the algorithm's convergence speed by introducing modifications with an adaptive learning rate.
翻译:本文针对传统向量量化算法(特别是K-Means及其变体K-Means++)的局限性展开研究,探讨了随机量化算法作为高维无监督与半监督学习任务中可扩展替代方案的潜力。传统聚类算法在计算过程中常存在内存利用率低下的问题,需要将所有数据样本加载至内存,这在大规模数据集上变得不可行。虽然Mini-Batch K-Means等变体通过降低内存使用部分缓解了该问题,但由于聚类问题的非凸特性,这些方法缺乏可靠的理论收敛保证。相比之下,随机量化算法提供了严格的理论收敛保证,使其成为聚类任务中具有鲁棒性的替代方案。我们在部分标注数据的图像分类问题上验证了该算法的计算效率与快速收敛性,比较了不同标注数据比例下的模型精度。为应对高维挑战,我们采用三元组网络将图像编码为潜在空间中的低维表示,以此作为比较随机量化算法与传统量化算法效率的基础。此外,我们通过引入自适应学习率的改进机制,进一步提升了算法的收敛速度。