Exploring the Potential of Probabilistic Transformer for Time Series Modeling: A Report on the ST-PT Framework

The Probabilistic Transformer (PT) establishes that the Transformer's self-attention plus its feed-forward block is mathematically equivalent to Mean-Field Variational Inference (MFVI) on a Conditional Random Field (CRF). Under this equivalence the Transformer ceases to be a black-box neural network and becomes a programmable factor graph: graph topology, factor potentials, and the message-passing schedule are all explicit and inspectable primitives that can be engineered. PT was originally developed for natural language and in this report we investigate its potential for time series. We first lift PT into the Spatial-Temporal Probabilistic Transformer (ST-PT) to repair PT's missing channel axis and weak per-step semantics, and adopt ST-PT as a shared cornerstone backbone. We then identify three distinct properties that PT/ST-PT offers as a factor-graph model and derive three Research Questions, one per property, that probe how each property can be exploited in time series: RQ1. The graph topology and potentials are direct programmable primitives. Can this be used to inject symbolic time-series priors into ST-PT through structural graph modifications, especially under data scarcity and noise? RQ2. The CRF's factor matrices are the operator's potentials. Can an external condition program these factor matrices on a per-sample basis, so that conditional generation becomes structural rather than feature-level modulation of a fixed one? RQ3. Each MFVI iteration is a Bayesian posterior update on the factor graph. Can this turn the latent transition of latent-space AutoRegressive (AR) forecasting from an opaque MLP into a principled posterior update, and can a CRF teacher distill its latents into the AR student to counter cumulative error? We give one empirical study per question. Together, these three studies position ST-PT as a programmable framework for time-series modeling.

翻译：概率Transformer（PT）证明了Transformer的自注意力机制与前馈模块在数学上等价于条件随机场（CRF）上的平均场变分推断（MFVI）。基于这一等价性，Transformer不再是一个黑箱神经网络，而成为可编程的因子图：图拓扑结构、因子势能以及消息传递调度都是显式且可检查的原语，可被工程设计。PT最初针对自然语言开发，本文报告了其在时间序列中的潜力。我们首先将PT扩展为时空概率Transformer（ST-PT），以修复PT缺失的通道轴和薄弱的逐步骤语义，并将ST-PT作为共享基础骨干。随后，我们识别出PT/ST-PT作为因子图模型提供的三个独特性质，并由此提出三个研究问题（RQ），分别针对每个性质探究其在时间序列中的应用潜力：RQ1. 图拓扑结构与势能是直接可编程的原语。通过结构性图修改能否将符号化的时间序列先验知识注入ST-PT，尤其是在数据稀缺和噪声条件下？RQ2. CRF的因子矩阵是算子的势能。外部条件能否按样本级别编程这些因子矩阵，使得条件生成成为结构性操作而非对固定模型的特征级调控？RQ3. 每次MFVI迭代是在因子图上进行贝叶斯后验更新。这能否将隐空间自回归（AR）预测中的隐状态转换从黑箱MLP转变为有原理的后验更新？能否通过CRF教师将其隐变量蒸馏至AR学生模型以抵消累积误差？我们针对每个问题给出了一项实证研究。这三项研究共同将ST-PT定位为时间序列建模的可编程框架。