Does human-AI assistance unfold in the same way as human-human assistance? This research explores what can be learned from the expertise of blind individuals and sighted volunteers to inform the design of multimodal voice agents and address the enduring challenge of proactivity. Drawing on granular analysis of two representative fragments from a larger corpus, we contrast the practices co-produced by an experienced human remote sighted assistant and a blind participant-as they collaborate to find a stain on a blanket over the phone-with those achieved when the same participant worked with a multimodal voice agent on the same task, a few moments earlier. This comparison enables us to specify precisely which fundamental proactive practices the agent did not enact in situ. We conclude that, so long as multimodal voice agents cannot produce environmentally occasioned vision-based actions, they will lack a key resource relied upon by human remote sighted assistants.
翻译:人机协助是否以与人人协助相同的方式展开?本研究探讨了从视障人士与视力正常志愿者的专业经验中可以汲取哪些启示,以指导多模态语音助手的设计,并解决主动性这一长期存在的挑战。通过对大型语料库中两个代表性片段进行细粒度分析,我们对比了经验丰富的人类远程视觉助手与视障参与者在电话协作中共同完成的实践(即共同寻找毯子上的污渍),以及同一参与者稍早前与多模态语音助手完成相同任务时所实现的实践。通过这一比较,我们得以精确指出该助手在现场未能实施哪些基本的主动性实践。我们的结论是:只要多模态语音助手无法产生基于环境触发的视觉驱动行为,它们就将缺失人类远程视觉助手所依赖的关键资源。