Enhancing Generative AI Image Refinement with Scribbles and Annotations: A Comparative Study of Multimodal Prompts

Generative AI (GenAI) image tools are increasingly used in design practice, enabling rapid ideation but offering limited support for refinement tasks such as adjusting layout, scale, or visual attributes. While text prompts and inpainting allow localized edits, they often remain inefficient or ambiguous for precise, in-context, and iterative refinement -- motivating the exploration of alternative methods. This work examines how pen-based scribbles and annotations can enhance GenAI image refinement. A formative study with seven professional designers informed a prototype supporting three input modalities: text-only, visual-only, and combined prompting. A within-subjects study with 30 designers and design students compared these modalities across closed- and open-ended tasks, evaluating expressiveness, efficiency, workload, user experience, iteration, and multimodal strategies. Visual prompts improved clarity and speed for spatial edits while reducing workload, whereas text remained effective for semantic and global changes. The combined modality received the highest overall ratings, enabling complementary use, balancing spatial precision with semantic detail, and supporting smoother iteration. Task-specific preferences also emerged: adding new objects often required both modalities, while moving or modifying elements was typically handled through visual input. This work contributes (1) an empirical comparison of multimodal prompting for GenAI refinement, (2) a prototype integrating scribbles and annotations, and (3) insights into designers' multimodal strategies to inform future GenAI interfaces that better support refinement in GenAI-supported design workflows.

翻译：生成式AI（GenAI）图像工具在设计实践中应用日益广泛，虽能实现快速构思，但在调整布局、比例或视觉属性等精修任务上支持有限。尽管文本提示和局部修复允许进行区域化编辑，但对于精确、情境化及迭代式精修而言，这些方法往往效率低下或表达模糊——这促使我们探索替代方案。本研究探讨基于笔触的涂鸦与标注如何增强GenAI图像精修。一项包含七位专业设计师的形制研究启发了支持三种输入模态的原型系统：纯文本、纯视觉及组合提示。通过对30位设计师与设计专业学生开展被试内实验，本研究在封闭式与开放式任务中比较了这些模态，评估了表达力、效率、工作负荷、用户体验、迭代过程及多模态策略。视觉提示显著提升了空间编辑的清晰度与速度，同时降低了工作负荷；而文本提示在语义与全局调整方面仍保持优势。组合模态获得了最高综合评分，它能实现互补使用，平衡空间精度与语义细节，并支持更流畅的迭代。研究还发现任务特异性偏好：添加新对象常需双模态配合，而移动或修改元素通常通过视觉输入完成。本研究的贡献在于：（1）对GenAI精修中多模态提示进行了实证比较；（2）开发了集成涂鸦与标注功能的原型系统；（3）通过洞察设计师的多模态策略，为未来GenAI界面设计提供参考，以更好地支持GenAI辅助设计工作流中的精修过程。