The increasing availability of large clinical datasets collected from patients can enable new avenues for computational characterization of complex diseases using different analytic algorithms. One of the promising new methods for extracting knowledge from large clinical datasets involves temporal pattern mining integrated with machine learning workflows. However, mining these temporal patterns is a computational intensive task and has memory repercussions. Current algorithms, such as the temporal sequence pattern mining (tSPM) algorithm, are already providing promising outcomes, but still leave room for optimization. In this paper, we present the tSPM+ algorithm, a high-performance implementation of the tSPM algorithm, which adds a new dimension by adding the duration to the temporal patterns. We show that the tSPM+ algorithm provides a speed up to factor 980 and a up to 48 fold improvement in memory consumption. Moreover, we present a docker container with an R-package, We also provide vignettes for an easy integration into already existing machine learning workflows and use the mined temporal sequences to identify Post COVID-19 patients and their symptoms according to the WHO definition.
翻译:随着从患者收集的大型临床数据集日益增多,利用不同分析算法对复杂疾病进行计算表征的新途径得以开辟。从大型临床数据集中提取知识的一种有前景的新方法涉及时间模式挖掘与机器学习工作流的集成。然而,挖掘这些时间模式是一项计算密集型任务,且存在内存开销问题。当前算法(如时间序列模式挖掘(tSPM)算法)已取得初步成果,但仍有优化空间。本文提出tSPM+算法,它是tSPM算法的高性能实现版本,通过向时间模式中增加持续时间维度,增添了新的考量因素。我们证明,tSPM+算法可实现高达980倍的加速比,以及最多48倍的内存消耗改善。此外,我们提供了一个包含R包的Docker容器,并提供了小插图以方便将其无缝集成到现有机器学习工作流中,同时利用挖掘出的时间序列,依据世界卫生组织的定义识别COVID-19后遗症患者及其症状。