Data compression for fast dimension reduction and clustering of high-dimensional discrete data

High-dimensional discrete data arise in many contemporary applications, including genomics, microbiome research, survey studies, and digital behavioral analysis. Clustering such data remains challenging because existing methods are often computationally demanding, sensitive to sparsity and discreteness, or designed for specific data types. We propose a deterministic dimension-reduction framework for clustering high-dimensional discrete observations. The method compresses each observation into a low-dimensional continuous representation through weighted sums defined by a scaled positional encoding, yielding a numerically stable transformation applicable to binary, categorical, and count-valued data. We establish several theoretical properties of the proposed compression. The mapping is injective, ensuring that distinct observations remain distinct after compression. Under mild regularity conditions, the compressed variables admit an approximate Gaussian representation, providing a theoretical basis for model-based clustering in the compressed space. We further show that separation between cluster centroids is preserved under compression, implying that location-driven cluster structure remains identifiable after dimension reduction. Extensive simulation studies demonstrate accurate cluster recovery across a wide range of realistic settings. The proposed approach is also computationally efficient, providing substantial speed improvements over commonly used dimension-reduction techniques often used in conjunction with clustering. Applications to Irish baby-name records and microbiome data further illustrate its practical utility. The proposed framework offers a scalable, computationally efficient, and broadly applicable approach to clustering high-dimensional discrete data.

翻译：高维离散数据广泛出现在基因组学、微生物组研究、调查研究及数字行为分析等当代应用中。对此类数据进行聚类仍具有挑战性，因为现有方法往往计算成本高、对稀疏性和离散性敏感，或针对特定数据类型设计。我们提出一种用于高维离散观测聚类的确定性降维框架。该方法通过可缩放位置编码定义的加权和，将每个观测压缩为低维连续表示，生成适用于二值、分类及计数型数据的数值稳定变换。我们建立了所提压缩方法的若干理论性质：该映射为单射，确保不同观测在压缩后仍保持区分性；在温和正则性条件下，压缩变量近似服从高斯分布，这为在压缩空间中进行基于模型的聚类提供了理论基础。我们进一步证明压缩过程保持聚类中心之间的分离度，意味着基于位置驱动的聚类结构在降维后仍可识别。广泛的仿真研究表明，该方法在多种现实场景下均能准确恢复聚类结构。所提方法计算高效，相较于常用于聚类配合的降维技术，能实现显著的加速效果。针对爱尔兰婴儿姓名记录及微生物组数据的应用进一步验证了其实用价值。该框架为高维离散数据聚类提供了一种可扩展、计算高效且广泛适用的解决方案。