Large Language Model (LLM)-based agents solve complex tasks through iterative reasoning, exploration, and tool-use, a process that can result in long, expensive context histories. While state-of-the-art Software Engineering (SE) agents like OpenHands or Cursor use LLM-based summarization to tackle this issue, it is unclear whether the increased complexity offers tangible performance benefits compared to simply omitting older observations. We present a systematic comparison of these approaches within SWE-agent on SWE-bench Verified across five diverse model configurations. Moreover, we show initial evidence of our findings generalizing to the OpenHands agent scaffold. We find that a simple environment observation masking strategy halves cost relative to the raw agent while matching, and sometimes slightly exceeding, the solve rate of LLM summarization. Additionally, we introduce a novel hybrid approach that further reduces costs by 7% and 11% compared to just observation masking or LLM summarization, respectively. Our findings raise concerns regarding the trend towards pure LLM summarization and provide initial evidence of untapped cost reductions by pushing the efficiency-effectiveness frontier. We release code and data for reproducibility.
翻译:基于大型语言模型(LLM)的智能体通过迭代推理、探索和工具使用来解决复杂任务,这一过程可能产生冗长且成本高昂的上下文历史记录。尽管当前最先进的软件工程(SE)智能体(如OpenHands或Cursor)采用基于LLM的摘要化方法来应对此问题,但尚不清楚这种增加的复杂性相较于简单地省略旧观察是否能带来实质性的性能优势。我们在SWE-agent框架内,基于SWE-bench Verified基准,对五种不同模型配置下的这些方法进行了系统比较。此外,我们展示了初步证据表明我们的发现在OpenHands智能体框架中具有普适性。我们发现,一种简单的环境观察掩码策略可将成本相对于原始智能体降低一半,同时达到与LLM摘要化相当的解决率,有时甚至略有超越。此外,我们提出了一种新颖的混合方法,相较于仅使用观察掩码或LLM摘要化,可进一步分别降低7%和11%的成本。我们的研究结果对当前纯粹依赖LLM摘要化的趋势提出了质疑,并为通过推动效率-效果前沿实现未开发成本降低提供了初步证据。我们发布了代码和数据以确保可复现性。