Computer Use Agents (CUAs) fundamentally rely on graphical user interface (GUI) grounding to translate language instructions into executable screen actions, but editing-level grounding in dense coding interfaces, where sub-pixel accuracy is required to interact with dense IDE elements, remains underexplored. Existing approaches typically rely on single-shot coordinate prediction, which lacks a mechanism for error correction and often fails in high-density interfaces. In this technical report, we conduct an empirical study of pixel-precise cursor localization in coding environments. Instead of a single-step execution, our agent engages in an iterative refinement process, utilizing visual feedback from previous attempts to reach the target element. This closed-loop grounding mechanism allows the agent to self-correct displacement errors and adapt to dynamic UI changes. We evaluate our approach across GPT-5.4, Claude, and Qwen on a suite of complex coding benchmarks, demonstrating that multi-turn refinement significantly outperforms state-of-the-art single-shot models in both click precision and overall task success rate. Our results suggest that iterative visual reasoning is a critical component for the next generation of reliable software engineering agents. Code: https://github.com/microsoft/precision-cua-bench.
翻译:摘要:计算机使用代理(CUAs)从根本上依赖图形用户界面(GUI)基础操作,将语言指令转化为可执行的屏幕动作,然而在密集编码界面中,需要亚像素精度与集成开发环境(IDE)元素交互的编辑级基础操作仍未得到充分探索。现有方法通常依赖单次坐标预测,缺乏纠错机制,在密集界面中常导致失败。本技术报告对编码环境中的像素级光标定位进行了实证研究。我们的代理并非采用单步执行,而是通过迭代精化过程,利用先前尝试的视觉反馈来定位目标元素。这种闭环基础操作机制使代理能够自我修正位移误差并适应动态用户界面变化。我们在GPT-5.4、Claude和Qwen上对一组复杂编码基准进行了评估,结果表明多轮精化方法在点击精度和整体任务成功率上均显著优于现有最佳单次模型。我们的研究结果表明,迭代视觉推理是下一代可靠软件工程代理的关键组成部分。代码:https://github.com/microsoft/precision-cua-bench。