In electronic health records (EHR) analysis, clustering patients according to patterns in their data is crucial for uncovering new subtypes of diseases. Existing medical literature often relies on classical hypothesis testing methods to test for differences in means between these clusters. Due to selection bias induced by clustering algorithms, the implementation of these classical methods on post-clustering data often leads to an inflated type-I error. In this paper, we introduce a new statistical approach that adjusts for this bias when analyzing data collected over time. Our method extends classical selective inference methods for cross-sectional data to longitudinal data. We provide theoretical guarantees for our approach with upper bounds on the selective type-I and type-II errors. We apply the method to simulated data and real-world Acute Kidney Injury (AKI) EHR datasets, thereby illustrating the advantages of our approach.
翻译:在电子健康记录(EHR)分析中,根据数据模式对患者进行聚类对于发现新的疾病亚型至关重要。现有医学文献常依赖经典假设检验方法检验这些聚类间的均值差异。由于聚类算法引入的选择偏差,在聚类后数据上实施这些经典方法通常会导致第一类错误膨胀。本文提出一种新的统计方法,用于校正分析时间序列数据时的这种偏差。我们的方法将面向横截面数据的经典选择性推断方法扩展至纵向数据。我们为该方法的理论性能提供了保证,给出了选择性第一类错误和第二类错误的上界。将该方法应用于模拟数据和真实世界急性肾损伤(AKI)EHR数据集,从而展示了本方法的优势。