Field studies are irreplaceable but costly, time-consuming, and error-prone, which need careful preparation. Inspired by rapid-prototyping in manufacturing, we propose a fast, low-cost evaluation method using Vision-Language Model (VLM) personas to simulate outcomes comparable to field results. While LLMs show human-like reasoning and language capabilities, autonomous vehicle (AV)-pedestrian interaction requires spatial awareness, emotional empathy, and behavioral generation. This raises our research question: To what extent can VLM personas mimic human responses in field studies? We conducted parallel studies: 1) one real-world study with 20 participants, and 2) one video-study using 20 VLM personas, both on a street-crossing task. We compared their responses and interviewed five HCI researchers on potential applications. Results show that VLM personas mimic human response patterns (e.g., average crossing times of 5.25 s vs. 5.07 s) lack the behavioral variability and depth. They show promise for formative studies, field study preparation, and human data augmentation.
翻译:实地研究虽不可替代,但其成本高昂、耗时且易出错,需要精心准备。受制造业快速原型设计的启发,我们提出一种快速、低成本的评估方法,利用视觉语言模型(VLM)角色来模拟可与实地结果相媲美的效果。尽管大型语言模型(LLM)展现出类人的推理与语言能力,但自动驾驶车辆(AV)与行人的交互需要空间感知、情感共情和行为生成能力。这引出了我们的研究问题:VLM角色能在多大程度上模拟实地研究中的人类反应?我们开展了并行研究:1)一项包含20名参与者的真实世界研究;2)一项使用20个VLM角色的视频研究,两者均基于街道过街任务。我们比较了它们的反应,并访谈了五位人机交互研究者探讨其潜在应用。结果表明,VLM角色能模仿人类反应模式(例如平均过街时间分别为5.25秒与5.07秒),但缺乏行为变异性和深度。它们在形成性研究、实地研究准备及人类数据增强方面展现出应用潜力。