Detecting Batch Heterogeneity via Likelihood Clustering

Batch effects represent a major confounder in genomic diagnostics. In copy number variant (CNV) detection from NGS, many algorithms compare read depth between test samples and a reference sample, assuming they are process-matched. When this assumption is violated, with causes ranging from reagent lot changes to multi-site processing, the reference becomes inappropriate, introducing false CNV calls or masking true pathogenic variants. Detecting such heterogeneity before downstream analysis is critical for reliable clinical interpretation. Existing batch effect detection methods either cluster samples based on raw features, risking conflation of biological signal with technical variation, or require known batch labels that are frequently unavailable. We introduce a method that addresses both limitations by clustering samples according to their Bayesian model evidence. The central insight is that evidence quantifies compatibility between data and model assumptions, technical artifacts violate assumptions and reduce evidence, whereas biological variation, including CNV status, is anticipated by the model and yields high evidence. This asymmetry provides a discriminative signal that separates batch effects from biology. We formalize heterogeneity detection as a likelihood ratio test for mixture structure in evidence space, using parametric bootstrap calibration to ensure conservative false positive rates. We validate our approach on synthetic data demonstrating proper Type I error control, three clinical targeted sequencing panels (liquid biopsy, BRCA, and thalassemia) exhibiting distinct batch effect mechanisms, and mouse electrophysiology recordings demonstrating cross-modality generalization. Our method achieves superior clustering accuracy compared to standard correlation-based and dimensionality-reduction approaches while maintaining the conservativeness required for clinical usage.

翻译：批次效应是基因组诊断中的一个主要混杂因素。在基于NGS的拷贝数变异检测中，许多算法通过比较测试样本与参照样本的测序深度进行操作，其前提假设是两者经过流程匹配。当该假设被违反时（原因包括试剂批次更换或多中心处理等），参照样本将不再适用，从而导致假阳性CNV检出或掩盖真实致病性变异。在下游分析前检测此类异质性对于可靠的临床解读至关重要。现有批次效应检测方法要么基于原始特征对样本进行聚类（存在生物学信号与技术变异混淆的风险），要么需要通常无法获取的已知批次标签。我们提出了一种新方法，通过根据样本的贝叶斯模型证据进行聚类来同时解决这两个局限。其核心洞见在于：证据量化了数据与模型假设的兼容性——技术伪影会违反假设并降低证据值，而包括CNV状态在内的生物学变异已被模型预期且会产生高证据值。这种不对称性提供了区分批次效应与生物学信号的判别性特征。我们将异质性检测形式化为证据空间中混合结构的似然比检验，并采用参数化自助法校准以确保保守的假阳性率。我们在合成数据（证明其具备正确的I类错误控制能力）、三个呈现不同批次效应机制的临床靶向测序panel（液体活检、BRCA和地中海贫血）以及展示跨模态泛化能力的小鼠电生理记录数据上验证了本方法。与基于标准相关性和降维的方法相比，我们的方法在保持临床使用所需保守性的同时，实现了更优的聚类精度。