Commonly, AI or machine learning (ML) models are evaluated on benchmark datasets. This practice supports innovative methodological research, but benchmark performance can be poorly correlated with performance in real-world applications -- a construct validity issue. To improve the validity and practical usefulness of evaluations, we propose using an estimands framework adapted from international clinical trials guidelines. This framework provides a systematic structure for inference and reporting in evaluations, emphasizing the importance of a well-defined estimation target. We illustrate our proposal on examples of commonly used evaluation methodologies - involving cross-validation, clustering evaluation, and LLM benchmarking - that can lead to incorrect rankings of competing models (rank reversals) with high probability, even when performance differences are large. We demonstrate how the estimands framework can help uncover underlying issues, their causes, and potential solutions. Ultimately, we believe this framework can improve the validity of evaluations through better-aligned inference, and help decision-makers and model users interpret reported results more effectively.
翻译:通常,人工智能或机器学习模型会在基准数据集上进行评估。这种做法支持创新性的方法学研究,但基准性能与实际应用中的性能之间可能存在弱相关性——这是一个构念效度问题。为提高评估的有效性与实用价值,我们提出采用源自国际临床试验指南的估计目标框架。该框架为评估中的推断与报告提供了系统化结构,强调明确定义估计目标的重要性。我们通过常用评估方法实例——涉及交叉验证、聚类评估和大语言模型基准测试——阐释本提案,这些方法即使在性能差异显著的情况下,仍可能以高概率导致竞争模型的错误排序(排序逆转)。我们展示了估计目标框架如何帮助揭示潜在问题、其原因及可能的解决方案。最终,我们相信该框架能通过更协调的推断提升评估效度,并帮助决策者和模型使用者更有效地解读报告结果。