Prompting and steering techniques are well established in general-purpose generative AI, yet assistive visual question answering (VQA) tools for blind users still follow rigid interaction patterns with limited opportunities for customization. User control can be helpful when system responses are misaligned with their goals and contexts, a gap that becomes especially consequential for blind users that may rely on these systems for access. We invite 11 blind users to customize their interactions with a real-world conversational VQA system. Drawing on 418 interactions, reflections, and post-study interviews, we analyze prompting-based techniques participants adopted, including those introduced in the study and those developed independently in real-world settings. VQA interactions were often lengthy: participants averaged 3 turns, sometimes up to 21, with input text typically tenfold shorter than the responses they heard. Built on state-of-the-art LLMs, the system lacked verbosity controls, was limited in estimating distance in space and time, relied on inaccessible image framing, and offered little to no camera guidance. We discuss how customization techniques such as prompt engineering can help participants work around these limitations. Alongside a new publicly available dataset, we offer insights for interaction design at both query and system levels.
翻译:提示与引导技术在通用生成式人工智能中已得到广泛应用,然而面向盲人用户的辅助性视觉问答工具仍遵循僵化的交互模式,其定制化机会有限。当系统响应与用户目标及使用情境不一致时,用户控制机制尤为重要——这对依赖此类系统获取视觉信息的盲人用户而言尤为关键。本研究邀请11位盲人用户对现实场景中的对话式视觉问答系统进行交互定制。基于418次交互记录、反思报告及研究后访谈,我们分析了参与者采用的提示技术,包括研究中引入的方法及他们在实际使用中自主开发的策略。视觉问答交互通常呈现冗长特征:参与者平均进行3轮对话(最高达21轮),其输入文本长度通常仅为所听取响应的十分之一。该系统基于前沿大语言模型构建,但存在以下局限:缺乏输出简洁度控制、时空距离估算能力有限、依赖不可访问的图像构图框架,且几乎不提供相机操作引导。我们探讨了提示工程等定制化技术如何帮助用户突破这些限制。通过新发布的开源数据集,我们在查询层面和系统层面为交互设计提供了新的见解。