In online clustering problems, there is often a large amount of uncertainty over possible cluster assignments that cannot be resolved until more data are observed. This difficulty is compounded when clusters follow complex distributions, as is the case with text data. Sequential Monte Carlo (SMC) methods give a natural way of representing and updating this uncertainty over time, but have prohibitive memory requirements for large-scale problems. We propose a novel SMC algorithm that decomposes clustering problems into approximately independent subproblems, allowing a more compact representation of the algorithm state. Our approach is motivated by the knowledge base construction problem, and we show that our method is able to accurately and efficiently solve clustering problems in this setting and others where traditional SMC struggles.
翻译:在线聚类问题中,聚类分配存在大量不确定性,这种不确定性通常需待更多数据观测后才能消除。当聚类服从复杂分布(如文本数据)时,该问题尤为棘手。序贯蒙特卡洛(SMC)方法提供了一种随时间推移自然表示与更新这种不确定性的途径,但其在大规模问题中需要极高的内存消耗。本文提出一种新型SMC算法,将聚类问题分解为近似独立的子问题,从而实现对算法状态更紧凑的表示。该方法的提出源于知识库构建问题,实验表明,本方法能在此类场景及传统SMC难以处理的类似问题中,实现准确高效的聚类求解。