Despite the demonstrated empirical efficacy of prompt tuning to adapt a pretrained language model for a new task, the theoretical underpinnings of the difference between "tuning parameters before the input" against "the tuning of model weights" are limited. We thus take one of the first steps to understand the role of soft-prompt tuning for transformer-based architectures. By considering a general purpose architecture, we analyze prompt tuning from the lens of both: universal approximation and limitations with finite-depth fixed-weight pretrained transformers for continuous-valued functions. Our universality result guarantees the existence of a strong transformer with a prompt to approximate any sequence-to-sequence function in the set of Lipschitz functions. The limitations of prompt tuning for limited-depth transformers are first proved by constructing a set of datasets, that cannot be memorized by a prompt of any length for a given single encoder layer. We also provide a lower bound on the required number of tunable prompt parameters and compare the result with the number of parameters required for a low-rank update (based on LoRA) for a single-layer setting. We finally extend our analysis to multi-layer settings by providing sufficient conditions under which the transformer can at best learn datasets from invertible functions only. Our theoretical claims are also corroborated by empirical results.
翻译:尽管提示调优在将预训练语言模型适配新任务时展现出显著的实证效果,但关于"输入前调优参数"与"模型权重调优"之间差异的理论基础仍十分有限。为此,我们率先尝试理解软提示调优在基于Transformer架构中的作用机理。通过考虑通用架构,我们从连续值函数的通用近似性与有限深度固定权重预训练Transformer的局限性两个维度,对提示调优展开分析。我们的普适性结论表明:存在具备提示机制的强Transformer,能够近似Lipschitz函数集中的任意序列到序列函数。针对有限深度Transformer的提示调优局限性,我们首次通过构建特定数据集集证明:在给定单编码器层中,任何长度的提示均无法完全记忆该数据集。我们还给出了所需可调提示参数数量的下界,并将其与单层场景下基于低秩更新(LoRA)所需的参数规模进行对比。最终,我们将分析扩展至多层架构,给出了Transformer仅能学习可逆函数数据集所需满足的充分条件。我们的理论主张也得到了实证结果的佐证。