Test-time adaptation enables large language models (LLMs) to modify their behavior at inference without updating model parameters. A common approach is many-shot prompting, where large numbers of in-context learning (ICL) examples are injected as an input-space test-time update. Although performance can improve as more demonstrations are added, the reliability and limits of this update mechanism remain poorly understood, particularly for open-source models. We present an empirical study of many-shot prompting across tasks and model backbones, analyzing how performance varies with update magnitude, example ordering, and selection policy. We further study Dynamic and Reinforced ICL as alternative test-time update strategies that control which information is injected and how it constrains model behavior. We find that many-shot prompting is effective for structured tasks where demonstrations provide high information gain, but is highly sensitive to selection strategy and often shows limited benefits for open-ended generation tasks. Overall, we characterize the practical limits of prompt-based test-time adaptation and outline when input-space updates are beneficial versus harmful.
翻译:测试时适应使大型语言模型(LLM)能够在推理阶段调整其行为而无需更新模型参数。一种常见方法是多样本提示,即将大量上下文学习(ICL)示例作为输入空间的测试时更新注入模型。尽管增加演示样本通常能提升性能,但这种更新机制的可靠性与局限性仍未得到充分理解,尤其对于开源模型。本文通过跨任务与模型架构的实验研究,分析了性能如何随更新规模、示例顺序及选择策略变化。我们进一步研究了动态上下文学习与强化上下文学习作为替代性测试时更新策略,以控制注入信息的内容及其对模型行为的约束作用。研究发现,多样本提示在演示能提供高信息增益的结构化任务中效果显著,但其性能对选择策略高度敏感,且在开放生成任务中往往收益有限。总体而言,本文系统阐述了基于提示的测试时适应的实际边界,并明确了输入空间更新在何种场景下具有增益或损害效应。