Long-horizon, repetitive workflows are common in professional settings, such as processing expense reports from receipts and entering student grades from exam papers. These tasks are often tedious for humans since they can extend to extreme lengths proportional to the size of the data to process. However, they are ideal for Computer-Use Agents (CUAs) due to their structured, recurring sub-workflows with logic that can be systematically learned. Identifying the absence of an evaluation benchmark as a primary bottleneck, we establish OS-Marathon, comprising 242 long-horizon, repetitive tasks across 2 domains to evaluate state-of-the-art (SOTA) agents. We then introduce a cost-effective method to construct a condensed demonstration using only few-shot examples to teach agents the underlying workflow logic, enabling them to execute similar workflows effectively on larger, unseen data collections. Extensive experiments demonstrate both the inherent challenges of these tasks and the effectiveness of our proposed method. Project website: https://os-marathon.github.io/.
翻译:在专业场景中,长时程、重复性的工作流十分常见,例如根据收据处理报销单或依据试卷录入学生成绩。由于此类任务的处理时长与待处理数据规模成正比,可能极其冗长,对人类而言通常单调乏味。然而,因其具有结构化、可重复的子工作流,且其逻辑可被系统化学习,这类任务对计算机使用智能体而言是理想的应用场景。针对评估基准缺失这一主要瓶颈,我们建立了OS-Marathon基准,涵盖2个领域的242项长时程重复性任务,用于评估最先进的智能体。随后,我们提出一种经济高效的方法,仅需少量示例即可构建精简演示,用以向智能体传授底层工作流逻辑,使其能够在更大规模、未见过的数据集合上有效执行类似工作流。大量实验证明了此类任务固有的挑战性以及我们所提方法的有效性。项目网站:https://os-marathon.github.io/。