Large Language Model (LLM)-based web agents excel at knowledge-intensive tasks but face a fundamental conflict between the need for extensive exploration and the constraints of limited context windows. Current solutions typically rely on architectural modifications, e.g., internal memory tokens, which break compatibility with pre-existing agents and necessitate costly end-to-end retraining. To overcome these limitations, we introduce ReSum, a lightweight, plug-and-play paradigm that enables unbounded exploration by periodically invoking an external tool to condense interaction histories into compact summaries. Although this paradigm functions without training, standard agents are not inherently aligned to reason over such compressed contexts. To bridge this gap, we propose ReSum-GRPO, which adapts Group Relative Policy Optimization (GRPO) via advantage broadcasting to propagate final rewards across segmented trajectories, enabling credit assignments over long-horizons. Extensive experiments show that ReSum achieves a 4.5% improvement over ReAct in training-free settings, with ReSum-GRPO yielding a further 8.2% gain. Notably, with only 1K training samples, a ReSum-enhanced 30B agent achieves competitive performance with leading open-source models, showing ReSum's effectiveness.
翻译:基于大语言模型(LLM)的Web代理在知识密集型任务中表现出色,但面临广泛探索需求与有限上下文窗口限制之间的根本性冲突。现有方案通常依赖架构修改(如内部记忆令牌),这破坏了与已有代理的兼容性,并需要高昂的端到端重新训练。为克服这些局限,我们提出ReSum,一种轻量级即插即用范式,通过周期性调用外部工具将交互历史压缩为紧凑摘要,从而实现无界探索。尽管该范式无需训练即可运作,但标准代理本质上并不擅长对这类压缩上下文进行推理。为弥合这一差距,我们提出ReSum-GRPO,通过优势广播机制适配组相对策略优化(GRPO),将最终奖励沿分段轨迹传播,实现长程信用分配。大量实验表明,在无训练场景下ReSum相比ReAct取得4.5%的性能提升,而ReSum-GRPO进一步带来8.2%的增益。值得注意的是,仅使用1000个训练样本,经ReSum增强的30B参数代理即可达到与主流开源模型相竞争的性能,充分验证了ReSum的有效性。