A statistical perspective on transformers for small longitudinal cohort data

Kiana Farhadyar,Maren Hackenberg,Kira Ahrens,Charlotte Schenk,Bianca Kollmann,Oliver Tüscher,Klaus Lieb,Michael M. Plichta,Andreas Reif,Raffael Kalisch,Martin Wolkewitz,Moritz Hess,Harald Binder

Modeling of longitudinal cohort data typically involves complex temporal dependencies between multiple variables. There, the transformer architecture, which has been highly successful in language and vision applications, allows us to account for the fact that the most recently observed time points in an individual's history may not always be the most important for the immediate future. This is achieved by assigning attention weights to observations of an individual based on a transformation of their values. One reason why these ideas have not yet been fully leveraged for longitudinal cohort data is that typically, large datasets are required. Therefore, we present a simplified transformer architecture that retains the core attention mechanism while reducing the number of parameters to be estimated, to be more suitable for small datasets with few time points. Guided by a statistical perspective on transformers, we use an autoregressive model as a starting point and incorporate attention as a kernel-based operation with temporal decay, where aggregation of multiple transformer heads, i.e. different candidate weighting schemes, is expressed as accumulating evidence on different types of underlying characteristics of individuals. This also enables a permutation-based statistical testing procedure for identifying contextual patterns. In a simulation study, the approach is shown to recover contextual dependencies even with a small number of individuals and time points. In an application to data from a resilience study, we identify temporal patterns in the dynamics of stress and mental health. This indicates that properly adapted transformers can not only achieve competitive predictive performance, but also uncover complex context dependencies in small data settings.

翻译：纵向队列数据建模通常涉及多个变量间复杂的时序依赖关系。Transformer架构在语言和视觉应用中已取得巨大成功，该架构使我们能够考虑到个体历史中最近观测的时间点未必总是对近期未来最重要这一事实。这是通过基于个体观测值的变换为其分配注意力权重实现的。这些思想尚未在纵向队列数据中得到充分利用的原因之一在于通常需要大规模数据集。因此，我们提出一种简化的Transformer架构，在保留核心注意力机制的同时减少待估计参数数量，使其更适用于时间点较少的小规模数据集。基于对Transformer的统计视角，我们以自回归模型为起点，将注意力机制作为具有时间衰减特性的基于核函数的操作进行整合，其中多个Transformer头（即不同的候选加权方案）的聚合被表述为对个体不同类型潜在特征的证据累积过程。这同时启用了基于置换的统计检验程序以识别上下文模式。在模拟研究中，该方法被证明即使在个体数量和时间点较少的情况下仍能恢复上下文依赖关系。在应用于韧性研究数据时，我们识别了压力与心理健康动态变化中的时序模式。这表明经过适当调整的Transformer不仅能够实现具有竞争力的预测性能，还能在小数据场景下揭示复杂的上下文依赖关系。