We examine the reliability of a widely used clinical AI benchmark whose reference labels were partially generated by LLMs, and find that a substantial fraction are clinically misaligned. We introduce a phased stewardship procedure to amplify the positive impact of physician experts' feedback and then demonstrate, via a controlled RL experiment, how uncaught label bias can materially affect downstream LLM evaluation and alignment. Our results demonstrate that partially LLM-generated labels can embed systemic errors that distort not only evaluation but also downstream model alignment. By adopting a hybrid oversight system, we can prioritize scarce expert feedback to maintain benchmarks as living, clinically-grounded documents. Ensuring this alignment is a prerequisite for the safe deployment of LLMs in high-stakes medical decision support.
翻译:我们检验了一个广泛使用的临床人工智能基准测试的可靠性,其参考标签部分由大型语言模型生成,结果发现相当大比例的标签存在临床偏差。我们引入了一种分阶段的监管流程,以放大医师专家反馈的积极影响,随后通过一项受控强化学习实验,展示了未被发现的标签偏差如何实质性地影响下游大型语言模型的评估与对齐。研究结果表明,部分由大型语言模型生成的标签可能嵌入系统性错误,这些错误不仅会扭曲评估结果,还会影响下游模型的对齐过程。通过采用混合监督系统,我们可以优先分配稀缺的专家反馈资源,将基准测试维护为具有临床依据的动态文档。确保这种对齐是实现大型语言模型在高风险医疗决策支持中安全部署的先决条件。