User interface (UI) design goes beyond visuals to shape user experience (UX), underscoring the shift toward UI/UX as a unified concept. While recent studies have explored UI evaluation using Multimodal Large Language Models (MLLMs), they largely focus on surface-level features, overlooking how design choices influence user behavior at scale. To fill this gap, we introduce WiserUI-Bench, a novel benchmark for multimodal understanding of how UI/UX design affects user behavior, built on 300 real-world UI image pairs from industry A/B tests, with empirically validated winners that induced more user actions. For future design progress in practice, post-hoc understanding of why such winners succeed with mass users is also required; we support this via expert-curated key interpretations for each instance. Experiments across multiple MLLMs on WiserUI-Bench for two main tasks, (1) predicting the more effective UI image between an A/B-tested pair, and (2) explaining it post-hoc in alignment with expert interpretations, show that models exhibit limited understanding of the behavioral impact of UI/UX design. We believe our work will foster research on leveraging MLLMs for visual design in user behavior contexts.
翻译:用户界面(UI)设计不仅关乎视觉呈现,更塑造着用户体验(UX),这凸显了UI/UX作为统一概念的发展趋势。尽管近期研究已开始利用多模态大语言模型(MLLMs)进行UI评估,但这些工作大多聚焦于表层特征,未能深入考察设计选择如何大规模影响用户行为。为填补这一空白,我们提出了WiserUI-Bench——一个用于多模态理解UI/UX设计如何影响用户行为的新型基准数据集。该数据集基于来自行业A/B测试的300对真实世界UI图像构建,其中包含经实证验证能促使用户执行更多操作的优势设计。为促进未来设计实践的发展,还需对优势设计为何能获得海量用户青睐进行事后归因分析;我们通过为每个实例提供专家精心策划的关键解读来支持这一分析。在WiserUI-Bench上对多个MLLMs进行的两项核心任务实验表明:(1)预测A/B测试图像对中更具效能的UI图像;(2)依据专家解读对预测结果进行事后归因解释,当前模型对UI/UX设计行为影响的理解仍存在局限。我们相信这项工作将推动MLLMs在用户行为语境下的视觉设计应用研究。