Large language models (LLMs) exhibit remarkable performance improvement through in-context learning (ICL) by leveraging task-specific examples in the input. However, the mechanisms behind this improvement remain elusive. In this work, we investigate embeddings and attention representations in Llama-2 70B and Vicuna 13B. Specifically, we study how embeddings and attention change after in-context-learning, and how these changes mediate improvement in behavior. We employ neuroscience-inspired techniques, such as representational similarity analysis (RSA), and propose novel methods for parameterized probing and attention ratio analysis (ARA, measuring the ratio of attention to relevant vs. irrelevant information). We designed three tasks with a priori relationships among their conditions: reading comprehension, linear regression, and adversarial prompt injection. We formed hypotheses about expected similarities in task representations to investigate latent changes in embeddings and attention. Our analyses revealed a meaningful correlation between changes in both embeddings and attention representations with improvements in behavioral performance after ICL. This empirical framework empowers a nuanced understanding of how latent representations affect LLM behavior with and without ICL, offering valuable tools and insights for future research and practical applications.
翻译:大型语言模型通过利用输入中的任务特定示例,在上下文学习中展现出显著的性能提升。然而,这种提升背后的机制仍难以捉摸。在本研究中,我们探讨了Llama-2 70B和Vicuna 13B中的嵌入表示与注意力表示。具体而言,我们研究了上下文学习前后嵌入与注意力的变化,以及这些变化如何介导行为表现的改进。我们采用了受神经科学启发的技术,如表征相似性分析,并提出了参数化探测与注意力比率分析的新方法。我们设计了三个任务,其条件之间存在先验关系:阅读理解、线性回归与对抗性提示注入。我们形成了关于任务表征预期相似性的假设,以探究嵌入与注意力的潜在变化。我们的分析揭示了上下文学习后嵌入与注意力表示的变化与行为性能改善之间存在有意义的关联。这一实证框架有助于深入理解潜在表示如何影响带或不带上下文学习的大型语言模型行为,为未来研究与实践应用提供了有价值的工具与洞见。