We study the following distribution clustering problem: Given a hidden partition of $k$ distributions into two groups, such that the distributions within each group are the same, and the two distributions associated with the two clusters are $\varepsilon$-far in total variation, the goal is to recover the partition. We establish upper and lower bounds on the sample complexity for two fundamental cases: (1) when one of the cluster's distributions is known, and (2) when both are unknown. Our upper and lower bounds characterize the sample complexity's dependence on the domain size $n$, number of distributions $k$, size $r$ of one of the clusters, and distance $\varepsilon$. In particular, we achieve tightness with respect to $(n,k,r,\varepsilon)$ (up to an $O(\log k)$ factor) for all regimes.
翻译:我们研究以下分布聚类问题:给定 $k$ 个分布的一个隐藏划分为两组,使得每组内的分布相同,且两个簇对应的两个分布在总变差上 $\varepsilon$-远离,目标是恢复该划分。我们对两种基本情况建立了样本复杂度的上下界:(1) 当一个簇的分布已知时,(2) 当两个簇的分布均未知时。我们的上下界刻画了样本复杂度对域大小 $n$、分布数量 $k$、其中一个簇的大小 $r$ 以及距离 $\varepsilon$ 的依赖关系。特别地,我们在所有情形下实现了关于 $(n,k,r,\varepsilon)$ 的紧致性(相差 $O(\log k)$ 因子以内)。