Language model agents (LMA) recently emerged as a promising paradigm on muti-step decision making tasks, often outperforming humans and other reinforcement learning agents. Despite the promise, their performance on real-world applications that often involve combinations of tasks is still underexplored. In this work, we introduce a new benchmark, called CompWoB -- 50 new compositional web automation tasks reflecting more realistic assumptions. We show that while existing prompted LMAs (gpt-3.5-turbo or gpt-4) achieve 94.0% average success rate on base tasks, their performance degrades to 24.9% success rate on compositional tasks. On the other hand, transferred LMAs (finetuned only on base tasks) show less generalization gap, dropping from 85.4% to 54.8%. By balancing data distribution across tasks, we train a new model, HTML-T5++, that surpasses human-level performance (95.2%) on MiniWoB, and achieves the best zero-shot performance on CompWoB (61.5%). While these highlight the promise of small-scale finetuned and transferred models for compositional generalization, their performance further degrades under different instruction compositions changing combinational order. In contrast to the recent remarkable success of LMA, our benchmark and detailed analysis emphasize the necessity of building LMAs that are robust and generalizable to task compositionality for real-world deployment.
翻译:语言模型体(LMA)近期在多步骤决策任务中展现出广阔前景,其表现常超越人类及其他强化学习智能体。然而,其在涉及多任务组合的现实应用中的表现尚未充分探索。本研究提出新基准CompWoB——包含50项反映更真实假设的新型组合式网页自动化任务。研究表明:现有基于提示的语言模型体(gpt-3.5-turbo或gpt-4)在基础任务上平均成功率达94.0%,但在组合任务中骤降至24.9%;而基于迁移学习的语言模型体(仅对基础任务微调)的泛化差距较小,成功率从85.4%降至54.8%。通过平衡任务间数据分布,我们训练了新模型HTML-T5++,在MiniWoB上超越人类水平(95.2%),并在CompWoB上取得最佳零样本性能(61.5%)。尽管这些成果凸显了小规模微调与迁移模型在组合泛化中的潜力,但其性能在不同指令组合顺序变化时进一步下降。与近期LMA的显著成功形成对比,本基准与详细分析强调:构建具有鲁棒性且能泛化至任务组合性的语言模型体,是现实部署的必要条件。