实验室环境下评估生成式AI：方法论挑战与指南 (Evaluating Generative AI in the Lab: Methodological Challenges and Guidelines)

Generative AI (GenAI) systems are inherently non-deterministic, producing varied outputs even for identical inputs. While this variability is central to their appeal, it challenges established HCI evaluation practices that typically assume consistent and predictable system behavior. Designing controlled lab studies under such conditions therefore remains a key methodological challenge. We present a reflective multi-case analysis of four lab-based user studies with GenAI-integrated prototypes, spanning conversational in-car assistant systems and image generation tools for design workflows. Through cross-case reflection and thematic analysis across all study phases, we identify five methodological challenges and propose eighteen practice-oriented recommendations, organized into five guidelines. These challenges represent methodological constructs that are either amplified, redefined, or newly introduced by GenAI's stochastic nature: (C1) reliance on familiar interaction patterns, (C2) fidelity-control trade-offs, (C3) feedback and trust, (C4) gaps in usability evaluation, and (C5) interpretive ambiguity between interface and system issues. Our guidelines address these challenges through strategies such as reframing onboarding to help participants manage unpredictability, extending evaluation with constructs such as trust and intent alignment, and logging system events, including hallucinations and latency, to support transparent analysis. This work contributes (1) a methodological reflection on how GenAI's stochastic nature unsettles lab-based HCI evaluation and (2) eighteen recommendations that help researchers design more transparent, robust, and comparable studies of GenAI systems in controlled settings.

翻译：生成式AI系统本质上是非确定性的，即使对于相同的输入也会产生多样化的输出。虽然这种可变性是其吸引力的核心，但它挑战了通常假设系统行为一致且可预测的既定人机交互评估实践。因此，在此类条件下设计受控的实验室研究仍然是一个关键的方法论挑战。我们对四项基于实验室的、涉及生成式AI集成原型的用户研究进行了反思性多案例分析，涵盖车载对话助手系统和设计工作流图像生成工具。通过跨案例反思以及对所有研究阶段的主题分析，我们识别出五个方法论挑战，并提出了十八条面向实践的建议，归纳为五项指南。这些挑战代表了因生成式AI的随机性而被放大、重新定义或新引入的方法论构念：(C1) 对熟悉交互模式的依赖，(C2) 保真度与控制之间的权衡，(C3) 反馈与信任，(C4) 可用性评估的差距，以及(C5) 界面问题与系统问题之间的解释模糊性。我们的指南通过以下策略应对这些挑战：重构引导环节以帮助参与者管理不可预测性，扩展评估构念（如信任和意图对齐），以及记录系统事件（包括幻觉和延迟）以支持透明分析。本工作的贡献在于：(1) 对生成式AI的随机性如何动摇基于实验室的人机交互评估进行了方法论反思；(2) 提供了十八条建议，帮助研究者在受控环境中设计更透明、更稳健且更具可比性的生成式AI系统研究。

相关内容

关注 7093

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

《人工智能：生成式AI的环境与人文影响》最新47页报告

专知会员服务

16+阅读 · 2025年7月15日

文本、视觉与语音生成的自动化评估方法综述

专知会员服务

20+阅读 · 2025年6月15日

如何做好AI研究？哈佛大学Pranav教授《AI研究经验》手册，259页pdf

专知会员服务

54+阅读 · 2025年1月5日

《AI生成视频评估综述》

专知会员服务

28+阅读 · 2024年10月30日