We consider the problem of mixed linear regression (MLR), where each observed sample belongs to one of $K$ unknown linear models. In practical applications, the proportions of the $K$ components are often imbalanced. Unfortunately, most MLR methods do not perform well in such settings. Motivated by this practical challenge, in this work we propose Mix-IRLS, a novel, simple and fast algorithm for MLR with excellent performance on both balanced and imbalanced mixtures. In contrast to popular approaches that recover the $K$ models simultaneously, Mix-IRLS does it sequentially using tools from robust regression. Empirically, Mix-IRLS succeeds in a broad range of settings where other methods fail. These include imbalanced mixtures, small sample sizes, presence of outliers, and an unknown number of models $K$. In addition, Mix-IRLS outperforms competing methods on several real-world datasets, in some cases by a large margin. We complement our empirical results by deriving a recovery guarantee for Mix-IRLS, which highlights its advantage on imbalanced mixtures.
翻译:我们考虑混合线性回归(MLR)问题,其中每个观测样本属于 $K$ 个未知线性模型之一。在实际应用中,$K$ 个分量的比例往往是非平衡的。遗憾的是,大多数 MLR 方法在此类场景下表现不佳。受这一实际挑战的启发,本文提出 Mix-IRLS——一种新颖、简单且快速的 MLR 算法,在平衡与非平衡混合场景中均表现出色。与同时恢复 $K$ 个模型的流行方法不同,Mix-IRLS 利用鲁棒回归工具顺序执行恢复。实验表明,Mix-IRLS 在多种其他方法失效的场景中成功运行,包括非平衡混合、小样本量、存在离群点以及未知模型数量 $K$。此外,Mix-IRLS 在多个真实世界数据集上显著优于竞争方法,某些情况下优势巨大。我们通过推导 Mix-IRLS 的恢复保证来补充实验结果,突显其在非平衡混合中的优势。