Clustering time-series data in healthcare is crucial for clinical phenotyping to understand patients' disease progression patterns and to design treatment guidelines tailored to homogeneous patient subgroups. While rich temporal dynamics enable the discovery of potential clusters beyond static correlations, two major challenges remain outstanding: i) discovery of predictive patterns from many potential temporal correlations in the multi-variate time-series data and ii) association of individual temporal patterns to the target label distribution that best characterizes the underlying clinical progression. To address such challenges, we develop a novel temporal clustering method, T-Phenotype, to discover phenotypes of predictive temporal patterns from labeled time-series data. We introduce an efficient representation learning approach in frequency domain that can encode variable-length, irregularly-sampled time-series into a unified representation space, which is then applied to identify various temporal patterns that potentially contribute to the target label using a new notion of path-based similarity. Throughout the experiments on synthetic and real-world datasets, we show that T-Phenotype achieves the best phenotype discovery performance over all the evaluated baselines. We further demonstrate the utility of T-Phenotype by uncovering clinically meaningful patient subgroups characterized by unique temporal patterns.
翻译:在医疗健康领域,对时间序列数据进行聚类对于临床表型分析至关重要,这有助于理解患者的疾病进展模式,并为同质患者亚群制定量身定制的治疗方案。尽管丰富的时间动态信息能够揭示超越静态相关性的潜在聚类,但仍有两大挑战亟待解决:i) 从多变量时间序列数据中的众多潜在时间相关性中发现预测性模式;ii) 将个体时间模式与最能刻画潜在临床进展的目标标签分布相关联。为应对这些挑战,我们提出了一种新颖的时间序列聚类方法——T-Phenotype,用于从带标签的时间序列数据中发现预测性时间模式的表型。我们引入了一种高效的频域表示学习方法,该方法能够将变长、非均匀采样的时间序列编码至统一的表示空间,并进一步利用一种基于路径相似性的新概念,识别可能对目标标签有贡献的各种时间模式。通过在合成数据集和真实数据集上的实验,我们证明T-Phenotype在所有评估基准中实现了最优的表型发现性能。我们还通过发现由独特时间模式表征的具有临床意义的患者亚群,进一步展示了T-Phenotype的实用性。