A simple and flexible test of sample exchangeability with applications to statistical genomics

In scientific studies involving analyses of multivariate data, basic but important questions often arise for the researcher: Is the sample exchangeable, meaning that the joint distribution of the sample is invariant to the ordering of the units? Are the features independent of one another, or perhaps the features can be grouped so that the groups are mutually independent? In statistical genomics, these considerations are fundamental to downstream tasks such as demographic inference and the construction of polygenic risk scores. We propose a non-parametric approach, which we call the V test, to address these two questions, namely, a test of sample exchangeability given dependency structure of features, and a test of feature independence given sample exchangeability. Our test is conceptually simple, yet fast and flexible. It controls the Type I error across realistic scenarios, and handles data of arbitrary dimensions by leveraging large-sample asymptotics. Through extensive simulations and a comparison against unsupervised tests of stratification based on random matrix theory, we find that our test compares favorably in various scenarios of interest. We apply the test to data from the 1000 Genomes Project, demonstrating how it can be employed to assess exchangeability of the genetic sample, or find optimal linkage disequilibrium (LD) splits for downstream analysis. For exchangeability assessment, we find that removing rare variants can substantially increase the p-value of the test statistic. For optimal LD splitting, the V test reports different optimal splits than previous approaches not relying on hypothesis testing. Software for our methods is available in R (CRAN: flintyR) and Python (PyPI: flintyPy).

翻译：在涉及多变量数据分析的科学研究中，研究人员常面临基础但重要的问题：样本是否可交换（即样本的联合分布是否不受单元排序影响）？特征之间是否相互独立，或者特征是否能被分组，使得各组之间相互独立？在统计基因组学中，这些考量对于下游任务（如群体遗传推断和多基因风险评分构建）至关重要。我们提出一种名为V检验的非参数方法，以解决这两个问题：即在给定特征依赖结构下检验样本可交换性，以及在给定样本可交换下检验特征独立性。我们的检验概念简单、计算快速且灵活，能在现实场景中控制第一类错误，并利用大样本渐近性质处理任意维度的数据。通过广泛模拟，并与基于随机矩阵理论的无监督分层检验进行对比，我们发现该检验在多种感兴趣场景中表现更优。我们将该方法应用于1000基因组计划数据，展示其如何用于评估遗传样本的可交换性，或为下游分析寻找最优连锁不平衡（LD）分段。在可交换性评估中，我们发现剔除稀有变异可显著提高检验统计量的p值。对于最优LD分段，V检验报告的最优分段与以往不依赖假设检验的方法不同。我们方法的软件已发布在R（CRAN: flintyR）和Python（PyPI: flintyPy）中。