Evaluating Generative AI in the Lab: Methodological Challenges and Guidelines

Generative AI (GenAI) systems are inherently non-deterministic, producing varied outputs even for identical inputs. While this variability is central to their appeal, it challenges established HCI evaluation practices that typically assume consistent and predictable system behavior. Designing controlled lab studies under such conditions therefore remains a key methodological challenge. We present a reflective multi-case analysis of four lab-based user studies with GenAI-integrated prototypes, spanning conversational in-car assistant systems and image generation tools for design workflows. Through cross-case reflection and thematic analysis across all study phases, we identify five methodological challenges and propose eighteen practice-oriented recommendations, organized into five guidelines. These challenges represent methodological constructs that are either amplified, redefined, or newly introduced by GenAI's stochastic nature: (C1) reliance on familiar interaction patterns, (C2) fidelity-control trade-offs, (C3) feedback and trust, (C4) gaps in usability evaluation, and (C5) interpretive ambiguity between interface and system issues. Our guidelines address these challenges through strategies such as reframing onboarding to help participants manage unpredictability, extending evaluation with constructs such as trust and intent alignment, and logging system events, including hallucinations and latency, to support transparent analysis. This work contributes (1) a methodological reflection on how GenAI's stochastic nature unsettles lab-based HCI evaluation and (2) eighteen recommendations that help researchers design more transparent, robust, and comparable studies of GenAI systems in controlled settings.

翻译：生成式人工智能（GenAI）系统本质上是非确定性的，即使在相同输入下也会产生多样化的输出。虽然这种可变性是其吸引力的核心，但它对通常假设系统行为一致且可预测的既定人机交互评估实践构成了挑战。因此，在此类条件下设计受控的实验室研究仍然是一个关键的方法论挑战。我们对四项基于实验室、涉及集成GenAI原型的用户研究进行了反思性的多案例分析，涵盖车载对话助手系统和设计工作流中的图像生成工具。通过对所有研究阶段的跨案例反思和主题分析，我们识别出五个方法论挑战，并提出了十八条面向实践的建议，归纳为五项指南。这些挑战代表了因GenAI的随机性而被放大、重新定义或新引入的方法论构念：（C1）对熟悉交互模式的依赖，（C2）保真度与控制之间的权衡，（C3）反馈与信任，（C4）可用性评估中的差距，以及（C5）界面问题与系统问题之间的解释模糊性。我们的指南通过多种策略应对这些挑战，例如：重构引导环节以帮助参与者管理不可预测性；扩展评估构念，纳入信任和意图对齐等维度；记录系统事件（包括幻觉和延迟）以支持透明分析。本工作的贡献在于：（1）对GenAI的随机性如何动摇基于实验室的人机交互评估进行了方法论反思；（2）提供了十八条建议，以帮助研究者在受控环境中设计更透明、稳健且可比较的GenAI系统研究。