Large language models (LLMs) trained on huge corpora of text datasets demonstrate complex, emergent capabilities, achieving state-of-the-art performance on tasks they were not explicitly trained for. The precise nature of LLM capabilities is often mysterious, and different prompts can elicit different capabilities through in-context learning. We propose a Cognitive Interpretability framework that enables us to analyze in-context learning dynamics to understand latent concepts in LLMs underlying behavioral patterns. This provides a more nuanced understanding than success-or-failure evaluation benchmarks, but does not require observing internal activations as a mechanistic interpretation of circuits would. Inspired by the cognitive science of human randomness perception, we use random binary sequences as context and study dynamics of in-context learning by manipulating properties of context data, such as sequence length. In the latest GPT-3.5+ models, we find emergent abilities to generate pseudo-random numbers and learn basic formal languages, with striking in-context learning dynamics where model outputs transition sharply from pseudo-random behaviors to deterministic repetition.
翻译:在大量文本数据集上训练的大语言模型展现了复杂的新兴能力,能够在未明确训练的任务上达到最优性能。大语言模型能力的精确本质往往是神秘的,而不同的提示可以通过上下文学习激发不同的能力。我们提出一个认知可解释性框架,使我们能够分析上下文学习动态,从而理解大语言模型中潜在的概念及其行为模式。这比成功或失败的评估基准提供了更细致的理解,但不需要像电路的机械解释那样观察内部激活。受人类随机性感知认知科学的启发,我们使用随机二元序列作为上下文,并通过操纵上下文数据的属性(如序列长度)来研究上下文学习的动态。在最新的GPT-3.5+模型中,我们发现了生成伪随机数和学习基础形式语言的新兴能力,并呈现出显著的上下文学习动态,其中模型输出从伪随机行为急剧转变为确定性重复。