Before deploying outputs from foundation models in high-stakes tasks, it is imperative to ensure that they align with human values. For instance, in radiology report generation, reports generated by a vision-language model must align with human evaluations before their use in medical decision-making. This paper presents Conformal Alignment, a general framework for identifying units whose outputs meet a user-specified alignment criterion. It is guaranteed that on average, a prescribed fraction of selected units indeed meet the alignment criterion, regardless of the foundation model or the data distribution. Given any pre-trained model and new units with model-generated outputs, Conformal Alignment leverages a set of reference data with ground-truth alignment status to train an alignment predictor. It then selects new units whose predicted alignment scores surpass a data-dependent threshold, certifying their corresponding outputs as trustworthy. Through applications to question answering and radiology report generation, we demonstrate that our method is able to accurately identify units with trustworthy outputs via lightweight training over a moderate amount of reference data. En route, we investigate the informativeness of various features in alignment prediction and combine them with standard models to construct the alignment predictor.
翻译:在将基础模型的输出应用于高风险任务之前,必须确保其符合人类价值观。例如,在放射学报告生成中,视觉语言模型生成的报告在用于医疗决策前,必须与人类评估保持一致。本文提出保形对齐,这是一个通用框架,用于识别其输出满足用户指定对齐标准的单元。该框架保证,平均而言,规定比例的选定单元确实满足对齐标准,且此保证独立于基础模型或数据分布。给定任何预训练模型及具有模型生成输出的新单元,保形对齐利用一组具有真实对齐状态的参考数据来训练一个对齐预测器。然后,它选择预测对齐分数超过一个依赖于数据的阈值的新单元,并将其相应输出认证为可信。通过在问答和放射学报告生成中的应用,我们证明我们的方法能够通过适量参考数据上的轻量级训练,准确识别具有可信输出的单元。在此过程中,我们研究了各种特征在对齐预测中的信息量,并将其与标准模型结合以构建对齐预测器。