In-Context Learning (ICL) is a phenomenon where task learning occurs through a prompt sequence without the necessity of parameter updates. ICL in Multi-Headed Attention (MHA) with absolute positional embedding has been the focus of more study than other sequence model varieties. We examine implications of architectural differences between GPT-2 and LLaMa as well as LlaMa and Mamba. We extend work done by Garg et al. (2022) and Park et al. (2024) to GPT-2/LLaMa hybrid and LLaMa/Mamba hybrid models - examining the interplay between sequence transformation blocks and regressive performance in-context. We note that certain architectural changes cause degraded training efficiency/ICL accuracy by converging to suboptimal predictors or converging slower. We also find certain hybrids showing optimistic performance improvements, informing potential future ICL-focused architecture modifications. Additionally, we propose the "ICL regression score", a scalar metric describing a model's whole performance on a specific task. Compute limitations impose restrictions on our architecture-space, training duration, number of training runs, function class complexity, and benchmark complexity. To foster reproducible and extensible research, we provide a typed, modular, and extensible Python package on which we run all experiments.
翻译:上下文学习(In-Context Learning, ICL)是一种通过提示序列进行任务学习而无需参数更新的现象。相较于其他序列模型变体,采用绝对位置嵌入的多头注意力(Multi-Headed Attention, MHA)中的ICL机制已得到更多研究关注。本文探讨了GPT-2与LLaMa之间以及LLaMa与Mamba之间的架构差异所产生的影响。我们将Garg等人(2022)和Park等人(2024)的研究拓展至GPT-2/LLaMa混合模型及LLaMa/Mamba混合模型,深入探究序列变换模块与回归性能在上下文中的相互作用。研究发现,某些架构变更会通过收敛至次优预测器或收敛速度减缓,导致训练效率与ICL准确度下降。同时,部分混合模型展现出具有前景的性能提升,这为未来面向ICL的架构改进提供了参考。此外,我们提出了“ICL回归分数”——一种用于描述模型在特定任务上整体性能的标量指标。计算资源限制对本研究的架构空间、训练时长、训练运行次数、函数类别复杂度及基准测试复杂度形成了约束。为促进可复现与可扩展的研究,我们提供了一个类型化、模块化且可扩展的Python软件包,所有实验均基于该软件包运行。