Generative AI (GenAI) systems are inherently non-deterministic, producing varied outputs even for identical inputs. While this variability is central to their appeal, it challenges established HCI evaluation practices that typically assume consistent and predictable system behavior. Designing controlled lab studies under such conditions therefore remains a key methodological challenge. We present a reflective multi-case analysis of four lab-based user studies with GenAI-integrated prototypes, spanning conversational in-car assistant systems and image generation tools for design workflows. Through cross-case reflection and thematic analysis across all study phases, we identify five methodological challenges and propose eighteen practice-oriented recommendations, organized into five guidelines. These challenges represent methodological constructs that are either amplified, redefined, or newly introduced by GenAI's stochastic nature: (C1) reliance on familiar interaction patterns, (C2) fidelity-control trade-offs, (C3) feedback and trust, (C4) gaps in usability evaluation, and (C5) interpretive ambiguity between interface and system issues. Our guidelines address these challenges through strategies such as reframing onboarding to help participants manage unpredictability, extending evaluation with constructs such as trust and intent alignment, and logging system events, including hallucinations and latency, to support transparent analysis. This work contributes (1) a methodological reflection on how GenAI's stochastic nature unsettles lab-based HCI evaluation and (2) eighteen recommendations that help researchers design more transparent, robust, and comparable studies of GenAI systems in controlled settings.
翻译:生成式AI系统本质上是非确定性的,即使对于相同的输入也会产生多样化的输出。虽然这种可变性是其吸引力的核心,但它挑战了通常假设系统行为一致且可预测的既定人机交互评估实践。因此,在此类条件下设计受控的实验室研究仍然是一个关键的方法论挑战。我们对四项基于实验室的、涉及生成式AI集成原型的用户研究进行了反思性多案例分析,涵盖车载对话助手系统和设计工作流图像生成工具。通过跨案例反思以及对所有研究阶段的主题分析,我们识别出五个方法论挑战,并提出了十八条面向实践的建议,归纳为五项指南。这些挑战代表了因生成式AI的随机性而被放大、重新定义或新引入的方法论构念:(C1) 对熟悉交互模式的依赖,(C2) 保真度与控制之间的权衡,(C3) 反馈与信任,(C4) 可用性评估的差距,以及(C5) 界面问题与系统问题之间的解释模糊性。我们的指南通过以下策略应对这些挑战:重构引导环节以帮助参与者管理不可预测性,扩展评估构念(如信任和意图对齐),以及记录系统事件(包括幻觉和延迟)以支持透明分析。本工作的贡献在于:(1) 对生成式AI的随机性如何动摇基于实验室的人机交互评估进行了方法论反思;(2) 提供了十八条建议,帮助研究者在受控环境中设计更透明、更稳健且更具可比性的生成式AI系统研究。