LLM agents can improve without weight updates by accumulating natural-language skills from experience, but current systems entrust every decision about which skills to keep and how to apply them to LLM judgment alone. We argue that this conflates two distinct roles: generating a skill from experience is a creative act that judgment handles well, while deciding whether that skill actually helps requires empirical evidence across many tasks. Measuring per-skill causal contributions via randomized masking, we find that skill libraries exhibit pervasive causal heterogeneity: individual skills routinely help on some task types while hurting on others, yet their opposing effects cancel in aggregate, making them invisible to global curation methods. We propose ASSAY, a framework that separates generation from curation: it computes a per-skill causal attribution on a small development set, restructures the library offline, and suppresses skills with negative predicted effect for each test task. Across seven base models spanning four providers and two benchmarks (AppWorld and tau-bench), ASSAY consistently improves over prior skill-curation approaches. On AppWorld's hardest split, DeepSeek-V3 achieves 69.3% task-goal completion (47.4% relative improvement), a new state of the art among all published methods including weight-tuned approaches. On tau-bench retail, GPT-4.1 improves by 8.7% relative, advancing past o4-mini, o1, and GPT-4.5 on the public leaderboard without any weight modification. Ablation traces the dominant gain to per-task masking, confirming that the bottleneck is matching skills to tasks at inference time, not removing bad skills globally. Code is available at https://github.com/aiming-lab/assay.
翻译:大语言模型智能体可通过从经验中积累自然语言技能,在不更新权重的情况下改进性能。然而现有系统将保留哪些技能以及如何应用技能的所有决策,仅交由大语言模型自身判断。我们认为此举混淆了两个不同角色:从经验生成技能属于创造性行为,判断能力可妥善处理;而判定某项技能是否真正有效,则需要在众多任务中进行实证检验。通过随机掩码测量单个技能的因果贡献,我们发现技能库存在显著的因果异质性:单个技能常对某些任务类型产生正向帮助,却对其他任务类型造成损害,但其相反效应在聚合后相互抵消,因而对全局筛选方法不可见。本文提出ASSAY框架,将生成与筛选过程分离:它基于小型开发集计算每个技能的因果归因,离线重构技能库,并为每个测试任务抑制预测效应为负的技能。在涵盖四个供应商的七种基础模型以及两个基准测试(AppWorld与tau-bench)上,ASSAY持续优于先前的技能筛选方法。在AppWorld最难数据划分中,DeepSeek-V3实现了69.3%的任务目标完成率(相对提升47.4%),创下包含权重调整方法在内的所有已发表方法中的最新标杆。在tau-bench零售场景中,GPT-4.1相对提升8.7%,在无任何权重修改的情况下超越公共排行榜上的o4-mini、o1和GPT-4.5。消融实验将主要增益归因于每个任务的掩码操作,证实瓶颈在于推理时技能与任务的匹配,而非全局性地去除不良技能。代码开源地址:https://github.com/aiming-lab/assay