Blind and low vision (BLV) creators use images to communicate with sighted audiences. However, creating or retrieving images is challenging for BLV creators as it is difficult to use authoring tools or assess image search results. Thus, creators limit the types of images they create or recruit sighted collaborators. While text-to-image generation models let creators generate high-fidelity images based on a text description (i.e. prompt), it is difficult to assess the content and quality of generated images. We present GenAssist, a system to make text-to-image generation accessible. Using our interface, creators can verify whether generated image candidates followed the prompt, access additional details in the image not specified in the prompt, and skim a summary of similarities and differences between image candidates. To power the interface, GenAssist uses a large language model to generate visual questions, vision-language models to extract answers, and a large language model to summarize the results. Our study with 12 BLV creators demonstrated that GenAssist enables and simplifies the process of image selection and generation, making visual authoring more accessible to all.
翻译:盲人和低视力(BLV)创作者常借助图像与正常视力受众进行交流。然而,由于难以使用创作工具或评估图像搜索结果,BLV创作者在创建或检索图像时面临重重挑战,因此他们往往限制所创建图像的类型或聘请正常视力合作者。尽管文本到图像生成模型可让创作者根据文本描述(即提示词)生成高保真图像,但评估生成图像的内容与质量仍十分困难。本文提出GenAssist系统,旨在让文本到图像生成技术更易使用。通过我们的界面,创作者可验证生成图像候选是否遵循提示词、获取提示词未指定的图像细节信息,并快速浏览各候选图像之间异同点的摘要。为驱动该界面,GenAssist采用大型语言模型生成视觉问题、视觉语言模型提取答案,再由大型语言模型汇总结果。与12名BLV创作者的实验表明,GenAssist能够简化并赋能图像选择与生成流程,使视觉创作对所有人更加可及。