While the great capability of Transformers significantly boosts prediction accuracy, it could also yield overconfident predictions and require calibrated uncertainty estimation, which can be commonly tackled by Gaussian processes (GPs). Existing works apply GPs with symmetric kernels under variational inference to the attention kernel; however, omitting the fact that attention kernels are in essence asymmetric. Moreover, the complexity of deriving the GP posteriors remains high for large-scale data. In this work, we propose Kernel-Eigen Pair Sparse Variational Gaussian Processes (KEP-SVGP) for building uncertainty-aware self-attention where the asymmetry of attention kernels is tackled by Kernel SVD (KSVD) and a reduced complexity is acquired. Through KEP-SVGP, i) the SVGP pair induced by the two sets of singular vectors from KSVD w.r.t. the attention kernel fully characterizes the asymmetry; ii) using only a small set of adjoint eigenfunctions from KSVD, the derivation of SVGP posteriors can be based on the inversion of a diagonal matrix containing singular values, contributing to a reduction in time complexity; iii) an evidence lower bound is derived so that variational parameters can be optimized towards this objective. Experiments verify our excellent performances and efficiency on in-distribution, distribution-shift and out-of-distribution benchmarks.
翻译:尽管Transformer的强大能力显著提升了预测精度,但其可能产生过度自信的预测,需要通过高斯过程(GPs)进行校准的不确定性估计。现有研究在变分推断框架下将对称核的GPs应用于注意力核,却忽视了注意力核本质上的非对称性。此外,大规模数据下推导GP后验的复杂度仍然较高。本文提出核-特征对稀疏变分高斯过程(KEP-SVGP),用于构建具有不确定性感知的自注意力机制:通过核奇异值分解(KSVD)处理注意力核的非对称性,并实现复杂度降低。通过KEP-SVGP:i)由KSVD关于注意力核的两组奇异向量诱导的SVGP对,完整刻画了非对称性;ii)仅需使用KSVD的少量伴随特征函数,SVGP后验推导可基于包含奇异值的对角矩阵求逆,从而降低时间复杂度;iii)推导出证据下界,使变分参数可针对该目标进行优化。实验在分布内、分布偏移和分布外基准测试中验证了本方法的卓越性能与效率。