Replicability is the cornerstone of scientific research. We study the replicability of data from high-throughput experiments, where tens of thousands of features are examined simultaneously. Existing replicability analysis methods either ignore the dependence among features or impose strong modelling assumptions, producing overly conservative or overly liberal results. Based on $p$-values from two studies, we use a four-state hidden Markov model to capture the structure of local dependence. Our method effectively borrows information from different features and studies while accounting for dependence among features and heterogeneity across studies. We show that the proposed method has better power than competing methods while controlling the false discovery rate, both empirically and theoretically. Analyzing datasets from genome-wide association studies reveals new biological insights that otherwise cannot be obtained by using existing methods.
翻译:可重复性是科学研究的基石。本研究探讨高通量实验中数据的可重复性,此类实验同时对数万个特征进行检测。现有可重复性分析方法或忽略特征间的依赖关系,或施加过强的模型假设,导致结果过于保守或过于宽松。基于两项研究的p值,我们采用四状态隐马尔可夫模型来捕捉局部依赖结构。该方法既能有效利用不同特征与研究的关联信息,又能兼顾特征间依赖关系及跨研究异质性。理论与实证均表明,本方法在控制错误发现率的同时,相较现有方法具有更优的统计功效。对全基因组关联研究数据集的分析揭示了现有方法无法获得的生物学新见解。