Long-horizon, repetitive workflows are common in professional settings, such as processing expense reports from receipts and entering student grades from exam papers. These tasks are often tedious for humans since they can extend to extreme lengths proportional to the size of the data to process. However, they are ideal for Computer-Use Agents (CUAs) due to their structured, recurring sub-workflows with logic that can be systematically learned. Identifying the absence of an evaluation benchmark as a primary bottleneck, we establish OS-Marathon, comprising 242 long-horizon, repetitive tasks across 2 domains to evaluate state-of-the-art (SOTA) agents. We then introduce a cost-effective method to construct a condensed demonstration using only few-shot examples to teach agents the underlying workflow logic, enabling them to execute similar workflows effectively on larger, unseen data collections. Extensive experiments demonstrate both the inherent challenges of these tasks and the effectiveness of our proposed method. Project website: https://os-marathon.github.io/.
翻译:在专业场景中,长时程、重复性的工作流程十分常见,例如根据收据处理报销报告或依据试卷录入学生成绩。由于这类任务的处理时长通常与待处理数据的规模成正比,可能极其冗长,对人类而言往往单调乏味。然而,对于计算机使用智能体而言,这些任务却是理想的应用场景,因为它们包含结构化、可重复的子工作流程,其内在逻辑能够被系统性地学习。针对当前缺乏评估基准这一主要瓶颈,我们构建了OS-Marathon基准,涵盖2个领域的242项长时程重复性任务,用于评估最先进的智能体。随后,我们提出了一种经济高效的方法,仅需少量示例即可构建浓缩的演示,从而向智能体传授底层工作流逻辑,使其能够在更大规模、未见过的数据集合上有效执行类似的工作流程。大量实验证明了这些任务本身固有的挑战性以及我们所提方法的有效性。项目网站:https://os-marathon.github.io/。