Recent empirical evidence indicates that transformer based in-context learning performs better when using a prefix language model (prefixLM), in which in-context samples can all attend to each other, compared to causal language models (causalLM), which use auto-regressive attention that prohibits in-context samples to attend to future samples. While this result is intuitive, it is not understood from a theoretical perspective. In this paper we take a theoretical approach and analyze the convergence behavior of prefixLM and causalLM under a certain parameter construction. Our analysis shows that both LM types converge to their stationary points at a linear rate, but that while prefixLM converges to the optimal solution of linear regression, causalLM convergence dynamics follows that of an online gradient descent algorithm, which is not guaranteed to be optimal even as the number of samples grows infinitely. We supplement our theoretical claims with empirical experiments over synthetic and real tasks and using various types of transformers. Our experiments verify that causalLM consistently underperforms prefixLM in all settings.
翻译:近期实验证据表明,在使用前缀语言模型(prefixLM)时,基于Transformer的上下文学习表现更优——该模型允许上下文样本之间相互关注,而因果语言模型(causalLM)采用自回归注意力机制,禁止上下文样本关注后续样本。尽管这一结果符合直觉,但从理论层面尚未得到充分理解。本文采用理论方法,分析了特定参数构造下prefixLM与causalLM的收敛行为。分析表明,两种语言模型均以线性速率收敛至其不动点,但prefixLM收敛至线性回归最优解,而causalLM的收敛动态遵循在线梯度下降算法,即便样本数量无限增长也无法保证最优性。我们通过合成数据与真实任务上的多种Transformer实验补充了理论主张,实验验证causalLM在所有场景下均持续劣于prefixLM。