LLM agents that store knowledge as natural language suffer steep retrieval degradation as condition count grows, often struggle to compose learned rules reliably, and typically lack explicit mechanisms to detect stale or adversarial knowledge. We introduce PRECEPT, a unified framework for test-time adaptation with three tightly coupled components: (1) deterministic exact-match rule retrieval over structured condition keys, (2) conflict-aware memory with Bayesian source reliability and threshold-based rule invalidation, and (3) COMPASS, a Pareto-guided prompt-evolution outer loop. Exact retrieval eliminates partial-match interpretation errors on the deterministic path (0% by construction, vs 94.4% under Theorem~B.6's independence model at N=10) and supports compositional stacking through a semantic tier hierarchy; conflict-aware memory resolves static--dynamic disagreements and supports drift adaptation; COMPASS evaluates prompts through the same end-to-end execution pipeline. Results (9--10 seeds): PRECEPT achieves a +41.1pp first-try advantage over Full Reflexion (d>1.9), +33.3pp compositional generalization (d=1.55), 100% $P_1$ on 2-way logistics compositions (d=2.64), +40--55pp continuous learning gains, strong eventual robustness under adversarial static knowledge (100% logistics with adversarial SK active; partial recovery on integration), +55.0pp drift recovery (d=0.95, p=0.031), and 61% fewer steps. Core comparisons are statistically significant, often at p<0.001.
翻译:将知识存储为自然语言的大语言模型(LLM)智能体,在条件数量增加时面临严重的检索性能下降,通常难以可靠地组合已学规则,且普遍缺乏检测知识陈旧性或对抗性知识的显式机制。本文提出PRECEPT,一个用于测试时自适应的统一框架,包含三个紧密耦合的组件:(1)基于结构化条件键的确定性精确匹配规则检索,(2)具备贝叶斯来源可靠性评估与基于阈值的规则失效机制的冲突感知记忆模块,以及(3)COMPASS——一个帕累托引导的提示演化外层循环。精确检索在确定性路径上消除了部分匹配解释错误(构造上为0%,对比定理B.6在N=10时的独立性模型下的94.4%),并通过语义层级结构支持组合堆叠;冲突感知记忆解决了静态-动态知识分歧并支持漂移适应;COMPASS通过相同的端到端执行流水线评估提示。实验结果(9-10次随机种子):PRECEPT相比Full Reflexion(d>1.9)获得+41.1个百分点的首次尝试优势,组合泛化能力提升+33.3个百分点(d=1.55),在双向物流组合任务上达到100%的$P_1$指标(d=2.64),持续学习增益达+40-55个百分点,在对抗性静态知识下表现出强最终鲁棒性(对抗性静态知识激活时物流任务100%完成;积分任务部分恢复),漂移恢复能力提升+55.0个百分点(d=0.95,p=0.031),且步骤数减少61%。核心对比结果均具有统计显著性,多数达到p<0.001。