Optimal Transport for Latent Integration with An Application to Heterogeneous Neuronal Activity Data

Detecting dynamic patterns of task-specific responses shared across heterogeneous datasets is an essential and challenging problem in many scientific applications in medical science and neuroscience. In our motivating example of rodent electrophysiological data, identifying the dynamical patterns in neuronal activity associated with ongoing cognitive demands and behavior is key to uncovering the neural mechanisms of memory. One of the greatest challenges in investigating a cross-subject biological process is that the systematic heterogeneity across individuals could significantly undermine the power of existing machine learning methods to identify the underlying biological dynamics. In addition, many technically challenging neurobiological experiments are conducted on only a handful of subjects where rich longitudinal data are available for each subject. The low sample sizes of such experiments could further reduce the power to detect common dynamic patterns among subjects. In this paper, we propose a novel heterogeneous data integration framework based on optimal transport to extract shared patterns in complex biological processes. The key advantages of the proposed method are that it can increase discriminating power in identifying common patterns by reducing heterogeneity unrelated to the signal by aligning the extracted latent spatiotemporal information across subjects. Our approach is effective even with a small number of subjects, and does not require auxiliary matching information for the alignment. In particular, our method can align longitudinal data across heterogeneous subjects in a common latent space to capture the dynamics of shared patterns while utilizing temporal dependency within subjects.

翻译：在医学和神经科学领域的众多科学应用中，检测跨异质数据集共享的任务特异性响应动态模式是一个关键且具有挑战性的问题。以啮齿类动物电生理数据为例，识别与持续认知需求和行为相关的神经元活动动态模式，是揭示记忆神经机制的关键。研究跨被试生物过程面临的最大挑战之一在于：个体间的系统性异质性可能显著削弱现有机器学习方法识别潜在生物动力学特征的能力。此外，许多技术难度较高的神经生物学实验仅能在少数被试上进行，虽然每个被试都具备丰富的纵向数据。此类实验的小样本量可能进一步降低检测被试间共有动态模式的能力。本文提出一种基于最优传输的新型异质数据整合框架，用于提取复杂生物过程中的共享模式。该方法的关键优势在于：通过跨被试对齐提取的潜在时空信息，减少与信号无关的异质性，从而增强识别共有模式的判别能力。即使被试数量较少，我们的方法依然有效，且无需借助辅助匹配信息进行对齐。特别地，本方法能够在公共潜在空间中对齐异质被试的纵向数据，在利用被试内部时间依赖性的同时，捕捉共享模式的动态特征。