Probabilistic Learning and Generation in Deep Sequence Models

Despite exceptional predictive performance of Deep sequence models (DSMs), the main concern of their deployment centers around the lack of uncertainty awareness. In contrast, probabilistic models quantify the uncertainty associated with unobserved variables with rules of probability. Notably, Bayesian methods leverage Bayes' rule to express our belief of unobserved variables in a principled way. Since exact Bayesian inference is computationally infeasible at scale, approximate inference is required in practice. Two major bottlenecks of Bayesian methods, especially when applied in deep neural networks, are prior specification and approximation quality. In Chapter 3 & 4, we investigate how the architectures of DSMs themselves can be informative for the design of priors or approximations in probabilistic models. We first develop an approximate Bayesian inference method tailored to the Transformer based on the similarity between attention and sparse Gaussian process. Next, we exploit the long-range memory preservation capability of HiPPOs (High-order Polynomial Projection Operators) to construct an interdomain inducing point for Gaussian process, which successfully memorizes the history in online learning. In addition to the progress of DSMs in predictive tasks, sequential generative models consisting of a sequence of latent variables are popularized in the domain of deep generative models. Inspired by the explicit self-supervised signals for these latent variables in diffusion models, in Chapter 5, we explore the possibility of improving other generative models with self-supervision for their sequential latent states, and investigate desired probabilistic structures over them. Overall, this thesis leverages inductive biases in DSMs to design probabilistic inference or structure, which bridges the gap between DSMs and probabilistic models, leading to mutually reinforced improvement.

翻译：尽管深度序列模型（DSMs）展现出卓越的预测性能，但其部署的核心关切在于缺乏不确定性感知。相比之下，概率模型通过概率规则量化与未观测变量相关的不确定性。值得注意的是，贝叶斯方法利用贝叶斯定理，以原则化的方式表达我们对未观测变量的信念。由于精确的贝叶斯推断在大规模计算中不可行，实践中需要采用近似推断。贝叶斯方法（尤其在应用于深度神经网络时）的两个主要瓶颈是先验设定与近似质量。在第三和第四章中，我们研究了DSMs自身的架构如何为概率模型中的先验或近似设计提供信息。我们首先基于注意力机制与稀疏高斯过程之间的相似性，开发了一种针对Transformer的近似贝叶斯推断方法。接着，我们利用HiPPOs（高阶多项式投影算子）的长程记忆保持能力，为高斯过程构建了一个跨域诱导点，该点成功地在在线学习中记忆了历史信息。除了DSMs在预测任务中的进展外，由一系列潜在变量构成的序列生成模型在深度生成模型领域也得到了普及。受扩散模型中这些潜在变量具有显式自监督信号的启发，在第五章中，我们探索了通过为其他生成模型的序列潜在状态引入自监督来改进它们的可能性，并研究了这些状态上期望的概率结构。总体而言，本论文利用DSMs中的归纳偏置来设计概率推断或结构，从而弥合了DSMs与概率模型之间的差距，实现了相互促进的改进。