In genetic studies, haplotype data provide more refined information than data about separate genetic markers. However, large-scale studies that genotype hundreds to thousands of individuals may only provide results of pooled data, where only the total allele counts of each marker in each pool are reported. Methods for inferring haplotype frequencies from pooled genetic data that scale well with pool size rely on a normal approximation, which we observe to produce unreliable inference when applied to real data. We illustrate cases where the approximation breaks down, due to the normal covariance matrix being near-singular. As an alternative to approximate methods, in this paper we propose exact methods to infer haplotype frequencies from pooled genetic data based on a latent multinomial model, where the observed allele counts are considered integer combinations of latent, unobserved haplotype counts. One of our methods, latent count sampling via Markov bases, achieves approximately linear runtime with respect to pool size. Our exact methods produce more accurate inference over existing approximate methods for synthetic data and for data based on haplotype information from the 1000 Genomes Project. We also demonstrate how our methods can be applied to time-series of pooled genetic data, as a proof of concept of how our methods are relevant to more complex hierarchical settings, such as spatiotemporal models.
翻译:在遗传研究中,单倍型数据比单独的遗传标记数据能提供更精细的信息。然而,对数百至数千个体进行基因分型的大规模研究可能仅提供混合数据的结果,其中仅报告每个混合池中每个标记的总等位基因计数。当前从混合遗传数据推断单倍型频率的方法中,能良好适应池规模的方法依赖于正态近似,但我们观察到该方法应用于真实数据时会产生不可靠的推断。我们展示了因协方差矩阵近似奇异导致近似方法失效的案例。作为近似方法的替代方案,本文提出基于潜在多项模型的精确方法,将观测到的等位基因计数视为潜在未观测单倍型计数的整数组合。其中一种基于马尔可夫基的潜在计数采样方法,其运行时间与池规模近似呈线性关系。对于合成数据以及基于1000基因组计划单倍型信息的数据,我们的精确方法比现有近似方法产生更准确的推断。我们还展示了如何将方法应用于混合遗传数据的时间序列分析,作为方法可适用于更复杂层级结构(如时空模型)的概念验证。