Benchmarks are necessary for healthcare evaluation, but are not sufficient for predicting deployment performance. Our position is that the evaluation--deployment gap arises not because of poorly designed benchmarks, but from implicit assumptions about how users interact with models that cannot be surfaced from benchmarks alone. To make this precise, we propose a classification of assumptions into two categories: task, which can be tested from conversation data alone, and outcome, which requires outcome data and behavioral studies for testing. Critically, outcome assumptions depend on human behavior, something that even well-designed benchmarks cannot directly observe. To demonstrate the operationality of this framework, we retrospectively analyze a healthcare RCT as a case study and find that the gap naturally separates into task and outcome gaps of roughly equal size. To address this, we make two contributions: first, we propose BenchmarkCards, an artifact that documents assumptions, and second, we propose staged evaluation, a procedure that systematically tests assumptions and evaluates performance.
翻译:医疗评估需要基准,但仅凭基准不足以预测模型的部署性能。我们提出,评估与部署之间的差距并非源于基准设计不当,而是来自关于用户与模型交互方式的隐含假设——这些假设无法仅通过基准测试暴露。为明确这一观点,我们将假设分为两类:任务假设(可通过会话数据单独检验)与结果假设(需依赖结果数据和行为研究进行验证)。关键区别在于,结果假设涉及人类行为,即便是设计良好的基准也无法直接观测。为验证该分类框架的可操作性,我们以一项医疗随机对照试验作为案例进行回顾性分析,发现评估-部署差距可自然分解为任务差距与结果差距,二者大致等量。针对这一问题,我们提出两项贡献:其一,设计“基准卡片”(BenchmarkCards)文档化假设;其二,提出“分阶段评估”流程,通过系统化检验假设并评估性能来弥合差距。