We present an audit mechanism for language models, with a focus on models deployed in the healthcare setting. Our proposed mechanism takes inspiration from clinical trial design where we posit the language model audit as a single blind equivalence trial, with the comparison of interest being the subject matter experts. We show that using our proposed method, we can follow principled sample size and power calculations, leading to the requirement of sampling minimum number of records while maintaining the audit integrity and statistical soundness. Finally, we provide a real-world example of the audit used in a production environment in a large-scale public health network.
翻译:本文提出了一种针对语言模型的审计机制,重点关注部署于医疗场景的模型。受临床试验设计启发,我们将语言模型审计构建为单盲等效性试验,以领域专家作为对照基准。研究表明,采用本方法可遵循规范的样本量与功效计算原则,在保证审计完整性与统计严谨性的同时,实现最小化病例记录抽样量。最后,我们提供了该审计机制在大型公共卫生网络生产环境中的实际应用案例。