FineState-Bench: Benchmarking State-Conditioned Grounding for Fine-grained GUI State Setting

Despite the rapid progress of large vision-language models (LVLMs), fine-grained, state-conditioned GUI interaction remains challenging. Current evaluations offer limited coverage, imprecise target-state definitions, and an overreliance on final-task success, obscuring where and why agents fail. To address this gap, we introduce \textbf{FineState-Bench}, a benchmark that evaluates whether an agent can correctly ground an instruction to the intended UI control and reach the exact target state. FineState-Bench comprises 2,209 instances across desktop, web, and mobile platforms, spanning four interaction families and 23 UI component types, with each instance explicitly specifying an exact target state for fine-grained state setting. We further propose \textit{FineState-Metrics}, a four-stage diagnostic pipeline with stage-wise success rates: Localization Success Rate (SR@Loc), Interaction Success Rate (SR@Int), Exact State Success Rate at Locate (ES-SR@Loc), and Exact State Success Rate at Interact (ES-SR@Int), and a plug-and-play \textit{Visual Diagnostic Assistant} (VDA) that generates a Description and a bounding-box Localization Hint to diagnose visual grounding reason via controlled w/ vs.\ w/o comparisons. On FineState-Bench, exact goal-state success remains low: ES-SR@Int peaks at 32.8\% on Web and 22.8\% on average across platforms. With VDA localization hints, Gemini-2.5-Flash gains +14.9 ES-SR@Int points, suggesting substantial headroom from improved visual grounding, yet overall accuracy is still insufficient for reliable fine-grained state-conditioned interaction \href{https://github.com/FengxianJi/FineState-Bench}{Github.}

翻译：尽管大型视觉语言模型（LVLMs）取得了快速进展，但细粒度的状态条件GUI交互仍具挑战。当前评估存在覆盖范围有限、目标状态定义不精确以及过度依赖最终任务成功率的缺陷，这掩盖了智能体在何处及为何失败。为解决这一空白，我们提出了**FineState-Bench**基准测试，用于评估智能体能否将指令正确定位到目标UI控件并达到精确目标状态。FineState-Bench包含覆盖桌面、网页和移动平台的2209个实例，涵盖四种交互类型和23种UI组件类型，每个实例均明确指定了用于细粒度状态设置的精确目标状态。我们进一步提出了**FineState-Metrics**，一个包含阶段式成功率的四阶段诊断流水线：定位成功率（SR@Loc）、交互成功率（SR@Int）、定位时精确状态成功率（ES-SR@Loc）和交互时精确状态成功率（ES-SR@Int），以及一个即插即用的**视觉诊断助手**（VDA），该助手通过生成描述和边界框定位提示，借助控制变量（有/无）对比来诊断视觉定位原因。在FineState-Bench上，精确目标状态成功率仍然较低：ES-SR@Int在网页端最高为32.8%，跨平台平均为22.8%。借助VDA定位提示，Gemini-2.5-Flash的ES-SR@Int提升了14.9个百分点，这表明改进视觉定位存在巨大空间，但整体准确率仍不足以支持可靠的细粒度状态条件交互[GitHub链接]。