Before deploying outputs from foundation models in high-stakes tasks, it is imperative to ensure that they align with human values. For instance, in radiology report generation, reports generated by a vision-language model must align with human evaluations before their use in medical decision-making. This paper presents Conformal Alignment, a general framework for identifying units whose outputs meet a user-specified alignment criterion. It is guaranteed that on average, a prescribed fraction of selected units indeed meet the alignment criterion, regardless of the foundation model or the data distribution. Given any pre-trained model and new units with model-generated outputs, Conformal Alignment leverages a set of reference data with ground-truth alignment status to train an alignment predictor. It then selects new units whose predicted alignment scores surpass a data-dependent threshold, certifying their corresponding outputs as trustworthy. Through applications to question answering and radiology report generation, we demonstrate that our method is able to accurately identify units with trustworthy outputs via lightweight training over a moderate amount of reference data. En route, we investigate the informativeness of various features in alignment prediction and combine them with standard models to construct the alignment predictor.
翻译:摘要:在将基础模型的输出部署到高风险任务之前,必须确保其与人类价值观对齐。例如,在放射学报告生成中,视觉-语言模型生成的报告在用于医疗决策前需确保其与人类评估一致。本文提出“一致性对齐”(Conformal Alignment)框架,这是一个通用框架,用于识别输出满足用户指定对齐标准的单元。该框架保证:无论基础模型或数据分布如何,被选中的单元中,平均而言,达到规定比例的结果确实满足对齐标准。对于任意预训练模型及其对新单元的输出,该框架利用一组具有真实对齐状态的参考数据训练对齐预测器,然后根据数据相关阈值选择预测对齐分数超过阈值的单元,并证明其对应输出可信。通过问答和放射学报告生成的应用,我们证明该方法能通过轻量级训练(基于适量参考数据)精准识别输出可信的单元。在此过程中,我们研究了不同特征在对齐预测中的信息量,并将其与标准模型结合构建对齐预测器。