We investigate methods for partitioning datasets into subgroups that maximize diversity within each subgroup while minimizing dissimilarity across subgroups. We introduce a novel partitioning method called the $\textit{Wasserstein Homogeneity Partition}$ (WHOMP), which optimally minimizes type I and type II errors that often result from imbalanced group splitting or partitioning, commonly referred to as accidental bias, in comparative and controlled trials. We conduct an analytical comparison of WHOMP against existing partitioning methods, such as random subsampling, covariate-adaptive randomization, rerandomization, and anti-clustering, demonstrating its advantages. Moreover, we characterize the optimal solutions to the WHOMP problem and reveal an inherent trade-off between the stability of subgroup means and variances among these solutions. Based on our theoretical insights, we design algorithms that not only obtain these optimal solutions but also equip practitioners with tools to select the desired trade-off. Finally, we validate the effectiveness of WHOMP through numerical experiments, highlighting its superiority over traditional methods.
翻译:本研究探讨了将数据集划分为子组的方法,旨在最大化每个子组内的多样性,同时最小化子组间的差异性。我们提出了一种新颖的划分方法,称为$\textit{Wasserstein同质性划分}$(WHOMP),该方法能最优地最小化在比较性和对照试验中,由于不平衡的组划分或分割(通常称为偶然性偏倚)所导致的I类和II类错误。我们对WHOMP与现有划分方法(如随机子抽样、协变量自适应随机化、再随机化和反聚类)进行了分析比较,证明了其优势。此外,我们刻画了WHOMP问题的最优解,并揭示了这些解中子组均值与方差稳定性之间固有的权衡关系。基于我们的理论见解,我们设计了不仅能获得这些最优解,还能为实践者提供工具以选择所需权衡的算法。最后,我们通过数值实验验证了WHOMP的有效性,突显了其相对于传统方法的优越性。