Modern data-driven and distributed learning frameworks deal with diverse massive data generated by clients spread across heterogeneous environments. Indeed, data heterogeneity is a major bottleneck in scaling up many distributed learning paradigms. In many settings however, heterogeneous data may be generated in clusters with shared structures, as is the case in several applications such as federated learning where a common latent variable governs the distribution of all the samples generated by a client. It is therefore natural to ask how the underlying clustered structures in distributed data can be exploited to improve learning schemes. In this paper, we tackle this question in the special case of estimating $d$-dimensional parameters of a two-component mixture of linear regressions problem where each of $m$ nodes generates $n$ samples with a shared latent variable. We employ the well-known Expectation-Maximization (EM) method to estimate the maximum likelihood parameters from $m$ batches of dependent samples each containing $n$ measurements. Discarding the clustered structure in the mixture model, EM is known to require $O(\log(mn/d))$ iterations to reach the statistical accuracy of $O(\sqrt{d/(mn)})$. In contrast, we show that if initialized properly, EM on the structured data requires only $O(1)$ iterations to reach the same statistical accuracy, as long as $m$ grows up as $e^{o(n)}$. Our analysis establishes and combines novel asymptotic optimization and generalization guarantees for population and empirical EM with dependent samples, which may be of independent interest.
翻译:现代数据驱动和分布式学习框架需处理由异构环境中分布的客户端生成的海量多样化数据。事实上,数据异构性是扩展许多分布式学习范式的主要瓶颈。然而在许多场景中,异构数据可能以共享结构的聚类形式生成,例如在联邦学习中,公共隐变量支配着客户端生成的所有样本的分布。因此,一个自然的问题是:如何利用分布式数据中潜在的聚类结构来改进学习方案?本文针对一个特殊问题展开研究:估计双分量线性回归混合模型中的d维参数,其中每个节点生成n个共享隐变量的样本。我们采用经典的期望最大化(EM)方法,从m批包含n个测量值的相依样本中估计最大似然参数。忽略混合模型中的聚类结构时,EM算法需要O(log(mn/d))次迭代才能达到O(√(d/(mn)))的统计精度。相比之下,我们证明:若初始化得当,当m以e^{o(n)}的速率增长时,针对结构化数据的EM算法仅需O(1)次迭代即可达到相同的统计精度。本文分析建立了含相依样本的总体EM与经验EM的新渐进优化与泛化保证,并将二者有机结合,这些结果可能具有独立的研究价值。