Before deploying outputs from foundation models in high-stakes tasks, it is imperative to ensure that they align with human values. For instance, in radiology report generation, reports generated by a vision-language model must align with human evaluations before their use in medical decision-making. This paper presents Conformal Alignment, a general framework for identifying units whose outputs meet a user-specified alignment criterion. It is guaranteed that on average, a prescribed fraction of selected units indeed meet the alignment criterion, regardless of the foundation model or the data distribution. Given any pre-trained model and new units with model-generated outputs, Conformal Alignment leverages a set of reference data with ground-truth alignment status to train an alignment predictor. It then selects new units whose predicted alignment scores surpass a data-dependent threshold, certifying their corresponding outputs as trustworthy. Through applications to question answering and radiology report generation, we demonstrate that our method is able to accurately identify units with trustworthy outputs via lightweight training over a moderate amount of reference data. En route, we investigate the informativeness of various features in alignment prediction and combine them with standard models to construct the alignment predictor.
翻译:在高风险任务中部署基础模型的输出之前,必须确保其与人类价值观保持一致。例如,在放射学报告生成任务中,视觉语言模型生成的报告在用于医疗决策前必须与人类评估标准对齐。本文提出保形对齐——一个通用框架,用于识别输出满足用户指定对齐标准的单元。该框架保证:平均而言,选定单元中达到规定比例的部分确实满足对齐标准,且该保证独立于基础模型或数据分布。给定任意预训练模型及带有模型生成输出的新单元,保形对齐利用一组具有真实对齐状态的参考数据训练对齐预测器,随后选择预测对齐分数超过数据依赖阈值的新单元,并认证其对应输出值得信赖。通过问答系统和放射学报告生成的应用实验,我们证明本方法能够通过适量参考数据的轻量级训练,精准识别具有可信输出的单元。在此过程中,我们探究了不同特征在对齐预测中的信息价值,并将其与标准模型结合以构建对齐预测器。