Applications of Machine Learning in Pharmacogenomics: Clustering Plasma Concentration-Time Curves

Pharmaceutical researchers are continually searching for techniques to improve both drug development processes and patient outcomes. An area of recent interest is the potential for machine learning (ML) applications within pharmacology. One such application not yet given close study is the unsupervised clustering of plasma concentration-time curves, hereafter, pharmacokinetic (PK) curves. In this paper, we present our findings on how to cluster PK curves by their similarity. Specifically, we find clustering to be effective at identifying similar-shaped PK curves and informative for understanding patterns within each cluster of PK curves. Because PK curves are time series data objects, our approach utilizes the extensive body of research related to the clustering of time series data as a starting point. As such, we examine many dissimilarity measures between time series data objects to find those most suitable for PK curves. We identify Euclidean distance as generally most appropriate for clustering PK curves, and we further show that dynamic time warping, Fr\'{e}chet, and structure-based measures of dissimilarity like correlation may produce unexpected results. As an illustration, we apply these methods in a case study with 250 PK curves used in a previous pharmacogenomic study. Our case study finds that an unsupervised ML clustering with Euclidean distance, without any subject genetic information, is able to independently validate the same conclusions as the reference pharmacogenomic results. To our knowledge, this is the first such demonstration. Further, the case study demonstrates how the clustering of PK curves may generate insights that could be difficult to perceive solely with population level summary statistics of PK metrics.

翻译：制药研究人员一直在寻求改进药物开发过程和患者预后的技术。近年来，关注热点之一是机器学习在药理学中的潜在应用。其中一个尚未被深入研究的应用是血浆浓度-时间曲线（以下简称药代动力学曲线）的无监督聚类。本文展示了如何根据药代动力学曲线的相似性对其进行聚类。具体而言，我们发现聚类能够有效识别形状相似的药代动力学曲线，并有助于理解每个聚类中曲线的模式。由于药代动力学曲线是时间序列数据对象，我们的方法以时间序列聚类的大量研究为起点。因此，我们考察了多种时间序列数据对象之间的不相似性度量，以找出最适合药代动力学曲线的度量。我们确定欧几里得距离通常最适用于药代动力学曲线的聚类，并进一步证明动态时间规整、弗雷歇距离以及基于结构的不相似性度量（如相关性）可能产生意外结果。为说明这一点，我们在一个包含250条药代动力学曲线（源自先前药物基因组学研究）的案例中应用了这些方法。案例研究发现，使用欧几里得距离的无监督机器学习聚类，在不包含任何受试者遗传信息的情况下，能够独立验证与参考药物基因组学结果相同的结论。据我们所知，这是首次此类演示。此外，该案例表明，药代动力学曲线的聚类可能产生仅凭药代动力学指标的人群水平汇总统计量难以察觉的洞见。