Multimodal agents can now tackle complex reasoning tasks with diverse tools, yet they still suffer from inefficient tool use and inflexible orchestration in open-ended settings. A central challenge is enabling such agents to continually improve without parameter updates by learning from past trajectories. We identify two complementary forms of reusable knowledge essential for this goal: experiences, providing concise action-level guidance for tool selection and decision making, and skills, providing structured task-level guidance for planning and tool use. To this end, we propose XSkill, a dual-stream framework for continual learning from experience and skills in multimodal agents. XSkill grounds both knowledge extraction and retrieval in visual observations. During accumulation, XSkill distills and consolidates experiences and skills from multi-path rollouts via visually grounded summarization and cross-rollout critique. During inference, it retrieves and adapts this knowledge to the current visual context and feeds usage history back into accumulation to form a continual learning loop. Evaluated on five benchmarks across diverse domains with four backbone models, XSkill consistently and substantially outperforms both tool-only and learning-based baselines. Further analysis reveals that the two knowledge streams play complementary roles in influencing the reasoning behaviors of agents and show superior zero-shot generalization.
翻译:多模态智能体如今能够借助多样化工具处理复杂推理任务,但在开放场景中仍存在工具使用效率低下与编排灵活性不足的问题。实现此类智能体无需参数更新即可通过历史轨迹持续改进的核心挑战在于:如何获取两种互补的可复用知识形式——经验(为工具选择与决策提供简洁的动作级指导)和技能(为规划与工具使用提供结构化的任务级指导)。为此,我们提出XSkill:一种面向多模态智能体、从经验与技能进行持续学习的双流框架。XSkill将知识提取与检索过程均锚定于视觉观测。在积累阶段,XSkill通过视觉锚定摘要与跨轨迹批判机制,从多路径推演中提炼并整合经验与技能;在推理阶段,则根据当前视觉上下文检索并适配此类知识,同时将使用历史反馈至积累阶段,形成持续学习闭环。在涵盖多领域的五个基准测试中,使用四种骨干模型的实验表明,XSkill始终显著优于纯工具方法与基于学习的基线模型。进一步分析揭示:两种知识流通过互补方式影响智能体的推理行为,并展现出卓越的零样本泛化能力。