Human annotations play a crucial role in evaluating the performance of GenAI models. Two common challenges in practice, however, are missing annotations (the response variable of interest) and cluster dependence among human-AI interactions (e.g., questions asked by the same user may be highly correlated). Reliable inference must address both these issues to achieve unbiased estimation and appropriately quantify uncertainty when estimating average scores from human annotations. In this paper, we analyze the doubly robust estimator, a widely used method in missing data analysis and causal inference, applied to this setting and establish novel theoretical properties under cluster dependence. We further illustrate our findings through simulations and a real-world conversation quality dataset. Our theoretical and empirical results underscore the importance of incorporating cluster dependence in missing response problems to perform valid statistical inference.
翻译:人类标注在评估生成式人工智能模型性能中发挥着关键作用。然而,实践中常面临两大挑战:标注缺失(即目标响应变量缺失)以及人机交互中的聚类依赖(例如,同一用户提出的问题可能高度相关)。可靠的统计推断必须同时解决这两个问题,以实现无偏估计,并在基于人类标注估计平均得分时恰当地量化不确定性。本文分析了双重稳健估计量——一种广泛应用于缺失数据分析和因果推断的方法——在此场景下的应用,并在聚类依赖条件下建立了新的理论性质。我们进一步通过模拟实验和一个真实世界对话质量数据集验证了研究结果。理论与实证结果均表明,在缺失响应问题中纳入聚类依赖对于进行有效的统计推断至关重要。