The latent block model is used to simultaneously rank the rows and columns of a matrix to reveal a block structure. The algorithms used for estimation are often time consuming. However, recent work shows that the log-likelihood ratios are equivalent under the complete and observed (with unknown labels) models and the groups posterior distribution to converge as the size of the data increases to a Dirac mass located at the actual groups configuration. Based on these observations, the algorithm $Largest$ $Gaps$ is proposed in this paper to perform clustering using only the marginals of the matrix, when the number of blocks is very small with respect to the size of the whole matrix in the case of binary data. In addition, a model selection method is incorporated with a proof of its consistency. Thus, this paper shows that studying simplistic configurations (few blocks compared to the size of the matrix or very contrasting blocks) with complex algorithms is useless since the marginals already give very good parameter and classification estimates.
翻译:潜变量块模型用于同时对矩阵的行和列进行排序,以揭示块结构。用于估计的算法通常耗时较长。然而,近期研究表明,在完整模型与观测模型(含未知标签)下,对数似然比是等价的,并且随着数据规模增大,组的后验分布将收敛于实际组配置处的狄拉克点质量。基于这些观察,本文提出了$Largest$ $Gaps$算法,针对块数量相对于整个矩阵规模非常小的二元数据情形,仅利用矩阵的边缘分布进行聚类。此外,本文还纳入了一种模型选择方法,并给出了其一致性的证明。因此,本文表明,对于简单配置(如块数远小于矩阵规模或块间对比度极显著的情形),使用复杂算法是徒劳的,因为边缘分布已经能提供非常准确的参数和分类估计。