User experience (UX) centered on usability, perceived consistency, and functional clarity is fundamental to real-world user interfaces (UI). The application of multimodal large language models (MLLMs) in the field of user interfaces is evolving rapidly, such as visual element grounding, graphical user interface (GUI) agents, and design-to-code generation. However, research efforts on evaluating UX based on UI screenshots are still immature. To address this, we propose UXBench, a novel multimodal benchmark consisting of 2,000 VQA data samples designed to assess MLLMs' ability to perform UI-based reasoning. UXBench includes 8 tasks based on real-world UI screenshots that require fine-grained diagnosis of UX issues across layout relationships, visual hierarchy, and content consistency. Our extensive evaluation of mainstream MLLMs shows that they remain fundamentally limited in their capacity for UI-based reasoning. The results underscore the need for further advancements in this area. To bridge this gap, we propose UI-UX, an MLLM based on Qwen3-VL-4B-Thinking foundation model and enhanced via reinforcement learning with two key innovations: a reward routing mechanism that dynamically balances perceptual understanding and logical reasoning during inference, and an asymmetric transition reward that suppresses redundant or insufficient reasoning steps. Experiments demonstrate that UI-UX achieves state-of-the-art (SOTA) performance on UXBench, attaining an accuracy of 0.7963 -- surpassing Claude-4.5-Sonnet's 0.6550 -- while exhibiting strong generalization across diverse UI tasks and maintaining low inference latency.
翻译:用户体验(UX)以可用性、感知一致性和功能清晰性为核心,是现实世界用户界面(UI)的基础。多模态大语言模型(MLLMs)在用户界面领域的应用正在快速发展,例如视觉元素定位、图形用户界面(GUI)代理以及设计到代码生成。然而,基于UI截图评估用户体验的研究工作仍不成熟。为解决这一问题,我们提出了UXBench,这是一个新颖的多模态基准,包含2000个VQA数据样本,旨在评估MLLMs执行基于UI的推理能力。UXBench包含8个基于真实UI截图的任务,需要对布局关系、视觉层次和内容一致性等方面的用户体验问题进行细粒度诊断。我们对主流MLLMs的广泛评估表明,它们在基于UI的推理能力上仍存在根本性局限。这些结果凸显了在该领域进一步推进的必要性。为弥补这一差距,我们提出了UI-UX,这是一个基于Qwen3-VL-4B-Base基础模型并通过强化学习增强的MLLM,具有两项关键创新:一种奖励路由机制,可在推理过程中动态平衡感知理解与逻辑推理;以及一种非对称过渡奖励,可抑制冗余或不足的推理步骤。实验表明,UI-UX在UXBench上达到了最优(SOTA)性能,准确率达到0.7963——超过了Claude-4.5-Sonnet的0.6550——同时在多样化UI任务中展现出强大的泛化能力,并保持了较低的推理延迟。