In this paper, we introduce various covering number bounds for linear function classes, each subject to different constraints on input and matrix norms. These bounds are contingent on the rank of each class of matrices. We then apply these bounds to derive generalization errors for single layer transformers. Our results improve upon several existing generalization bounds in the literature and are independent of input sequence length, highlighting the advantages of employing low-rank matrices in transformer design. More specifically, our achieved generalisation error bound decays as $O(1/\sqrt{n})$ where $n$ is the sample length, which improves existing results in research literature of the order $O((\log n)/(\sqrt{n}))$. It also decays as $O(\log r_w)$ where $r_w$ is the rank of the combination of query and and key matrices.
翻译:本文针对线性函数类引入了多种覆盖数界,每种界均受到输入范数和矩阵范数的不同约束。这些界取决于各类矩阵的秩。随后,我们应用这些界推导出单层Transformer的泛化误差。我们的结果改进了文献中现有的若干泛化误差界,且与输入序列长度无关,从而凸显了在Transformer设计中采用低秩矩阵的优势。具体而言,我们获得的泛化误差界以$O(1/\sqrt{n})$的速率衰减,其中$n$为样本长度,这改进了现有研究中$O((\log n)/(\sqrt{n}))$量级的结果。该误差界还以$O(\log r_w)$的速率衰减,其中$r_w$为查询矩阵与键矩阵组合的秩。