Computer-use agents (CUAs) automate on-screen work, as illustrated by GPT-5.4 and Claude. Yet their reliability on complex, low-frequency interactions is still poor, limiting user trust. Our analysis of failure cases from advanced models suggests a long-tail pattern in GUI operations, where a relatively small fraction of complex and diverse interactions accounts for a disproportionate share of task failures. We hypothesize that this issue largely stems from the scarcity of data for complex interactions. To address this problem, we propose a new benchmark CUActSpot for evaluating models' capabilities on complex interactions across five modalities: GUI, text, table, canvas, and natural image, as well as a variety of actions (click, drag, draw, etc.), covering a broader range of interaction types than prior click-centric benchmarks that focus mainly on GUI widgets. We also design a renderer-based data-synthesis pipeline: scenes are automatically generated for each modality, screenshots and element coordinates are recorded, and an LLM produces matching instructions and action traces. After training on this corpus, our Phi-Ground-Any-4B outperforms open-source models with fewer than 32B parameters. We will release our benchmark, data, code, and models at https://github.com/microsoft/Phi-Ground.git
翻译:计算机使用代理(CUAs)能够自动化屏幕操作,如GPT-5.4和Claude所示。然而,它们在复杂、低频交互中的可靠性仍然较差,这限制了用户的信任。我们对先进模型失败案例的分析表明,GUI操作中存在长尾模式,即一小部分复杂多样的交互导致了不成比例的任务失败。我们假设这一问题主要源于复杂交互数据的稀缺性。为解决此问题,我们提出了一个新基准CUActSpot,用于评估模型在五种模态(GUI、文本、表格、画布和自然图像)及多种动作(点击、拖拽、绘制等)上的复杂交互能力,覆盖了比以往主要关注GUI控件的点击中心基准更广泛的交互类型。我们还设计了一个基于渲染器的数据合成流水线:为每种模态自动生成场景,记录截图和元素坐标,并由大语言模型(LLM)生成匹配的指令和动作轨迹。在此语料库训练后,我们的Phi-Ground-Any-4B模型在性能上优于参数量小于32B的开源模型。我们将于https://github.com/microsoft/Phi-Ground.git发布我们的基准、数据、代码和模型。