Identifying complex phenotypes from high-dimensional biological data is challenging due to the intricate interdependencies among different physiological indicators. Traditional approaches often focus on detecting outliers in single variables, overlooking the broader network of interactions that contribute to phenotype emergence. Here, we introduce ODBAE (Outlier Detection using Balanced Autoencoders), a machine learning method designed to uncover both subtle and extreme outliers by capturing latent relationships among multiple physiological parameters. ODBAE's revised loss function enhances its ability to detect two key types of outliers: influential points (IP), which disrupt latent correlations between dimensions, and high leverage points (HLP), which deviate from the norm but go undetected by traditional autoencoder-based methods. Using data from the International Mouse Phenotyping Consortium (IMPC), we show that ODBAE can identify knockout mice with complex, multi-indicator phenotypes - normal in individual traits, but abnormal when considered together. In addition, this method reveals novel metabolism-related genes and uncovers coordinated abnormalities across metabolic indicators. Our results highlight the utility of ODBAE in detecting joint abnormalities and advancing our understanding of homeostatic perturbations in biological systems.
翻译:从高维生物数据中识别复杂表型具有挑战性,这源于不同生理指标之间错综复杂的相互依赖关系。传统方法通常侧重于检测单一变量的异常值,忽略了促成表型出现的更广泛的相互作用网络。本文介绍ODBAE(基于平衡自编码器的异常检测),这是一种旨在通过捕捉多个生理参数间的潜在关系来揭示细微和极端异常值的机器学习方法。ODBAE改进的损失函数增强了其检测两种关键异常值的能力:影响点(IP),它会破坏维度间的潜在相关性;以及高杠杆点(HLP),它偏离常态但被传统的基于自编码器的方法所忽略。利用国际小鼠表型联盟(IMPC)的数据,我们证明ODBAE能够识别具有复杂多指标表型的基因敲除小鼠——这些小鼠在单个性状上正常,但综合考量时则表现异常。此外,该方法揭示了新的代谢相关基因,并发现了代谢指标间的协同异常。我们的研究结果凸显了ODBAE在检测联合异常以及增进对生物系统稳态扰动理解方面的实用性。