Bi-clustering is a technique that allows for the simultaneous clustering of observations and features in a dataset. This technique is often used in bioinformatics, text mining, and time series analysis. An important advantage of biclustering algorithm is the ability to uncover multiple ``views'' (i.e., through rows and column groupings) in the data. Several Gaussian mixture model based biclustering approach currently exist in the literature. However, they impose severe restrictions on the structure of the covariance matrix. Here, we propose a Gaussian mixture model-based bi-clustering approach that provides a more flexible block-diagonal covariance structure. We show that the clustering accuracy of the proposed model is comparable to other known techniques but our approach provides a more flexible covariance structure and has substantially lower computational time. We demonstrate the application of the proposed model in bioinformatics and topic modelling.
翻译:双聚类是一种能够同时对数据集中的观测值和特征进行聚类的技术。该技术常用于生物信息学、文本挖掘和时间序列分析等领域。双聚类算法的一个重要优势在于能够揭示数据中多个"视图"(即通过行与列的聚类分组)。目前文献中已有多种基于高斯混合模型的双聚类方法,但这些方法对协方差矩阵的结构施加了严格的限制。本文提出一种基于高斯混合模型的双聚类方法,该方法提供了更灵活的块对角协方差结构。研究表明,所提模型的聚类精度与其他已知技术相当,但我们的方法具有更灵活的协方差结构,且计算时间显著降低。我们展示了该模型在生物信息学和主题建模中的应用。