Language model agents (LMA) recently emerged as a promising paradigm on muti-step decision making tasks, often outperforming humans and other reinforcement learning agents. Despite the promise, their performance on real-world applications that often involve combinations of tasks is still underexplored. In this work, we introduce a new benchmark, called CompWoB -- 50 new compositional web automation tasks reflecting more realistic assumptions. We show that while existing prompted LMAs (gpt-3.5-turbo or gpt-4) achieve 94.0% average success rate on base tasks, their performance degrades to 24.9% success rate on compositional tasks. On the other hand, transferred LMAs (finetuned only on base tasks) show less generalization gap, dropping from 85.4% to 54.8%. By balancing data distribution across tasks, we train a new model, HTML-T5++, that surpasses human-level performance (95.2%) on MiniWoB, and achieves the best zero-shot performance on CompWoB (61.5%). While these highlight the promise of small-scale finetuned and transferred models for task compositionality, their performance further degrades under different instruction compositions changing combinational order. In contrast to the recent remarkable success of LMA, our benchmark and detailed analysis emphasize the necessity of building LMAs that are robust and generalizable to task compositionality for real-world deployment.
翻译:语言模型智能体(LMA)近期在多步骤决策任务中展现出前景广阔的发展态势,其表现常超越人类及其他强化学习智能体。然而,在涉及任务组合的真实应用场景中,其性能仍有待深入探究。本研究提出全新基准测试CompWoB——包含50个更具现实假设的复合型网页自动化任务。实验表明,尽管基于提示的现有LMA(gpt-3.5-turbo或gpt-4)在基础任务上平均成功率达94.0%,但在组合任务中其性能骤降至24.9%。相比之下,迁移式LMA(仅基于基础任务微调)展现更小的泛化差距,成功率从85.4%降至54.8%。通过跨任务的数据分布平衡,我们训练的HTML-T5++模型在MiniWoB上超越人类水平(95.2%),并在CompWoB上实现最优零样本性能(61.5%)。虽然这些成果彰显了小型微调迁移模型在任务组合性方面的潜力,但面对不同指令组合(改变组合顺序)时其性能仍会进一步退化。与近期LMA的显著成功形成对比,本研究的基准测试与细致分析强调:构建对任务组合性具有鲁棒性和泛化能力的LMA,是其在真实场景部署中的必要条件。