Recent empirical evidence indicates that transformer based in-context learning performs better when using a prefix language model (prefixLM), in which in-context samples can all attend to each other, compared to causal language models (causalLM), which use auto-regressive attention that prohibits in-context samples to attend to future samples. While this result is intuitive, it is not understood from a theoretical perspective. In this paper we take a theoretical approach and analyze the convergence behavior of prefixLM and causalLM under a certain parameter construction. Our analysis shows that both LM types converge to their stationary points at a linear rate, but that while prefixLM converges to the optimal solution of linear regression, causalLM convergence dynamics follows that of an online gradient descent algorithm, which is not guaranteed to be optimal even as the number of samples grows infinitely. We supplement our theoretical claims with empirical experiments over synthetic and real tasks and using various types of transformers. Our experiments verify that causalLM consistently underperforms prefixLM in all settings.
翻译:近期实验证据表明,基于前缀语言模型(prefixLM)的Transformer上下文学习性能优于因果语言模型(causalLM)。在前缀语言模型中,所有上下文样本可相互关注;而因果语言模型采用自回归注意力机制,禁止上下文样本关注后续样本。尽管这一结论具有直观性,但其理论机制尚未得到充分理解。本文从理论角度出发,分析特定参数构造下前缀语言模型与因果语言模型的收敛行为。分析表明:两类语言模型均以线性速率收敛至固定点,但前缀语言模型可收敛至线性回归最优解,而因果语言模型的收敛动态等价于在线梯度下降算法,即使样本数量趋于无穷大,其收敛结果也无法保证最优。我们通过合成任务与真实任务上的实验(涉及多种Transformer架构)佐证理论结论,实验证实因果语言模型在所有设定下均持续劣于前缀语言模型。