An inherent challenge in computing fully-explicit generalization bounds for transformers involves obtaining covering number estimates for the given transformer class $T$. Crude estimates rely on a uniform upper bound on the local-Lipschitz constants of transformers in $T$, and finer estimates require an analysis of their higher-order partial derivatives. Unfortunately, these precise higher-order derivative estimates for (realistic) transformer models are not currently available in the literature as they are combinatorially delicate due to the intricate compositional structure of transformer blocks. This paper fills this gap by precisely estimating all the higher-order derivatives of all orders for the transformer model. We consider realistic transformers with multiple (non-linearized) attention heads per block and layer normalization. We obtain fully-explicit estimates of all constants in terms of the number of attention heads, the depth and width of each transformer block, and the number of normalization layers. Further, we explicitly analyze the impact of various standard activation function choices (e.g. SWISH and GeLU). As an application, we obtain explicit pathwise generalization bounds for transformers on a single trajectory of an exponentially-ergodic Markov process valid at a fixed future time horizon. We conclude that real-world transformers can learn from $N$ (non-i.i.d.) samples of a single Markov process's trajectory at a rate of ${O}(\operatorname{polylog}(N)/\sqrt{N})$.
翻译:计算Transformer模型完全显式泛化界的一个固有挑战在于获取给定Transformer类$T$的覆盖数估计。粗略估计依赖于对$T$中Transformer局部Lipschitz常数的一致上界,而精细估计则需要分析其高阶偏导数。遗憾的是,由于Transformer模块复杂的组合结构导致组合计算极为精细,目前文献中尚未提供(实际可用的)Transformer模型的精确高阶导数估计。本文通过精确估计Transformer模型所有阶次的高阶导数填补了这一空白。我们考虑具有多个(非线性化)注意力头及层归一化的实际Transformer结构。我们获得了所有常数的完全显式估计,这些估计以注意力头数量、每个Transformer模块的深度和宽度以及归一化层数量为参数。此外,我们显式分析了各种标准激活函数选择(如SWISH和GeLU)的影响。作为应用,我们在固定未来时间范围内,针对指数遍历马尔可夫过程的单条轨迹,获得了Transformer的显式路径泛化界。由此得出结论:实际Transformer能够以${O}(\operatorname{polylog}(N)/\sqrt{N})$的速率从单个马尔可夫过程轨迹的$N$个(非独立同分布)样本中进行学习。