AI model alignment is crucial due to inadvertent biases in training data and the underspecified pipeline in modern machine learning, where numerous models with excellent test set metrics can be produced, yet they may not meet end-user requirements. Recent advances demonstrate that post-training model alignment via human feedback can address some of these challenges. However, these methods are often confined to settings (such as generative AI) where humans can interpret model outputs and provide feedback. In traditional non-generative settings, where model outputs are numerical values or classes, detecting misalignment through single-sample outputs is highly challenging. In this paper we consider an alternative strategy. We propose interpreting model alignment through property testing, defining an aligned model $f$ as one belonging to a subset $\mathcal{P}$ of functions that exhibit specific desired behaviors. We focus on post-processing a pre-trained model $f$ to better align with $\mathcal{P}$ using conformal risk control. Specifically, we develop a general procedure for converting queries for a given property $\mathcal{P}$ to a collection of loss functions suitable for use in a conformal risk control algorithm. We prove a probabilistic guarantee that the resulting conformal interval around $f$ contains a function approximately satisfying $\mathcal{P}$. Given the capabilities of modern AI models with extensive parameters and training data, one might assume alignment issues will resolve naturally. However, increasing training data or parameters in a random feature model doesn't eliminate the need for alignment techniques when pre-training data is biased. We demonstrate our alignment methodology on supervised learning datasets for properties like monotonicity and concavity. Our flexible procedure can be applied to various desired properties.
翻译:AI模型对齐至关重要,原因在于训练数据中存在的无意识偏见以及现代机器学习中定义不明确的流程——尽管可以产生众多测试集指标优异的模型,但这些模型可能无法满足最终用户需求。近期研究表明,通过人类反馈进行训练后模型对齐能够应对部分挑战。然而,这些方法通常局限于人类可解读模型输出并提供反馈的场景(例如生成式AI)。在传统非生成式场景中,当模型输出为数值或类别时,通过单样本输出检测错位极具挑战性。本文探讨一种替代策略:我们提出通过属性测试来阐释模型对齐,将对齐模型$f$定义为属于函数子集$\mathcal{P}$的模型,该子集展现出特定的期望行为。我们专注于使用共形风险控制对预训练模型$f$进行后处理,以更好地与$\mathcal{P}$对齐。具体而言,我们开发了一种通用流程,将给定属性$\mathcal{P}$的查询转换为适用于共形风险控制算法的损失函数集合。我们证明了所得共形区间围绕$f$的概率保证,该区间包含近似满足$\mathcal{P}$的函数。鉴于现代AI模型具有海量参数和训练数据的能力,人们可能认为对齐问题会自然解决。然而,在随机特征模型中增加训练数据或参数,并不能消除预训练数据存在偏见时对对齐技术的需求。我们在监督学习数据集上展示了针对单调性和凹性等属性的对齐方法。我们的灵活流程可适用于多种期望属性。