Unified multimodal models (UMMs) have shown impressive capabilities in generating natural images and supporting multimodal reasoning. However, their potential in supporting computer-use planning tasks, which are closely related to our lives, remain underexplored. Image generation and editing in computer-use tasks require capabilities like spatial reasoning and procedural understanding, and it is still unknown whether UMMs have these capabilities to finish these tasks or not. Therefore, we propose PlanViz, a new benchmark designed to evaluate image generation and editing for computer-use tasks. To achieve the goal of our evaluation, we focus on sub-tasks which frequently involve in daily life and require planning steps. Specifically, three new sub-tasks are designed: route planning, work diagramming, and web&UI displaying. We address challenges in data quality ensuring by curating human-annotated questions and reference images, and a quality control process. For challenges of comprehensive and exact evaluation, a task-adaptive score, PlanScore, is proposed. The score helps understanding the correctness, visual quality and efficiency of generated images. Through experiments, we highlight key limitations and opportunities for future research on this topic.
翻译:统一多模态模型(UMMs)在生成自然图像和支持多模态推理方面展现出令人印象深刻的能力。然而,其在支持与日常生活密切相关的计算机使用规划任务方面的潜力仍未得到充分探索。计算机使用任务中的图像生成与编辑需要空间推理和流程理解等能力,目前尚不清楚UMMs是否具备完成这些任务所需的能力。为此,我们提出PlanViz——一个专为评估计算机使用任务的图像生成与编辑而设计的新基准。为实现评估目标,我们聚焦于日常生活中频繁涉及且需要规划步骤的子任务。具体设计了三个新的子任务:路径规划、工作流程图绘制以及网页与用户界面展示。我们通过构建人工标注的问题与参考图像,并实施质量控制流程,以应对数据质量保障的挑战。针对全面精准评估的挑战,我们提出了任务自适应评分指标PlanScore。该指标有助于理解生成图像的正确性、视觉质量与效率。通过实验,我们揭示了该领域未来研究的关键局限性与发展机遇。