Large language models (LLMs) are increasingly deployed as UX judges that inspect interfaces, diagnose usability problems, and propose repairs. Yet no controlled benchmark measures whether the resulting critiques are reliable and actionable across heterogeneous product surfaces. We introduce UXBench, a benchmark for evaluating LLMs as interaction-grounded UX judges. UXBench comprises local-first runnable web fixtures spanning ten product-surface families, paired with coverage-gated browser exploration that forces models to collect interaction evidence before reporting. Each judge model produces a structured UX report over seven rubric dimensions; report quality is measured by whether a fixed downstream repair agent can improve the interface based on the critique. We evaluate eight frontier models under both an automated repair-lift protocol and a blind human validation study. Results show that UX judging is neither saturated nor one dimensional: models differ meaningfully in report actionability, exhibit distinct rubric-level repair signatures, vary in fixture-level reliability, and trade leadership across surface categories
翻译:大语言模型(LLMs)正越来越多地被部署为UX评判员,负责检查界面、诊断可用性问题并提出修复方案。然而,目前尚无受控基准来衡量这些评判在异构产品界面上的可靠性和可操作性。本文提出UXBench,一个用于评估LLMs作为交互驱动型UX评判员的基准。UXBench包含覆盖十个产品界面系列的本地可运行Web装置,并配以覆盖门控的浏览器探索机制,强制模型在报告前收集交互证据。每个评判模型需在七个评分维度上生成结构化UX报告;报告质量通过固定下游修复代理能否基于该评判改进界面的指标来衡量。我们通过自动化修复提升协议和盲法人工验证研究评估了八个前沿模型。结果表明,UX评判既未饱和也非单一维度:模型在报告可操作性上存在显著差异,展现出不同的评分维度级修复特征,在装置级可靠性上各有差异,并在不同界面类别间交替领先。