TARSE: Test-Time Adaptation via Retrieval of Skills and Experience for Reasoning Agents

Complex clinical decision making often fails not because a model lacks facts, but because it cannot reliably select and apply the right procedural knowledge and the right prior example at the right reasoning step. We frame clinical question answering as an agent problem with two explicit, retrievable resources: skills, reusable clinical procedures such as guidelines, protocols, and pharmacologic mechanisms; and experience, verified reasoning trajectories from previously solved cases (e.g., chain-of-thought solutions and their step-level decompositions). At test time, the agent retrieves both relevant skills and experiences from curated libraries and performs lightweight test-time adaptation to align the language model's intermediate reasoning with clinically valid logic. Concretely, we build (i) a skills library from guideline-style documents organized as executable decision rules, (ii) an experience library of exemplar clinical reasoning chains indexed by step-level transitions, and (iii) a step-aware retriever that selects the most useful skill and experience items for the current case. We then adapt the model on the retrieved items to reduce instance-step misalignment and to prevent reasoning from drifting toward unsupported shortcuts. Experiments on medical question-answering benchmarks show consistent gains over strong medical RAG baselines and prompting-only reasoning methods. Our results suggest that explicitly separating and retrieving clinical skills and experience, and then aligning the model at test time, is a practical approach to more reliable medical agents.

翻译：复杂的临床决策失败往往并非因为模型缺乏事实知识，而是由于无法在正确的推理步骤中可靠地选择并应用恰当的程序性知识与先验案例。我们将临床问答构建为具备两种显式可检索资源的智能体问题：技能（可复用的临床规程，如指南、协议与药理机制）与经验（来自已解决案例的已验证推理轨迹，例如思维链解及其步骤级分解）。在测试时，智能体从经整理的资源库中检索相关技能与经验，并通过轻量级测试时自适应使语言模型的中间推理与临床有效逻辑对齐。具体而言，我们构建了：（i）由指南式文档组织为可执行决策规则的技能库；（ii）通过步骤级转换索引的范例临床推理链经验库；（iii）能够为当前病例选择最有效技能与经验条目的步骤感知检索器。随后，我们基于检索条目对模型进行自适应，以减少实例-步骤错位，并防止推理漂移至缺乏支持的捷径。在医学问答基准测试上的实验表明，相较于强大的医学RAG基线及纯提示推理方法，本方法取得了持续的性能提升。我们的结果表明：显式分离并检索临床技能与经验，继而在测试时对齐模型，是实现更可靠医学智能体的一种实用途径。