AI model alignment is crucial due to inadvertent biases in training data and the underspecified machine learning pipeline, where models with excellent test metrics may not meet end-user requirements. While post-training alignment via human feedback shows promise, these methods are often limited to generative AI settings where humans can interpret and provide feedback on model outputs. In traditional non-generative settings with numerical or categorical outputs, detecting misalignment through single-sample outputs remains challenging, and enforcing alignment during training requires repeating costly training processes. In this paper we consider an alternative strategy. We propose interpreting model alignment through property testing, defining an aligned model $f$ as one belonging to a subset $\mathcal{P}$ of functions that exhibit specific desired behaviors. We focus on post-processing a pre-trained model $f$ to better align with $\mathcal{P}$ using conformal risk control. Specifically, we develop a general procedure for converting queries for testing a given property $\mathcal{P}$ to a collection of loss functions suitable for use in a conformal risk control algorithm. We prove a probabilistic guarantee that the resulting conformal interval around $f$ contains a function approximately satisfying $\mathcal{P}$. We exhibit applications of our methodology on a collection of supervised learning datasets for (shape-constrained) properties such as monotonicity and concavity. The general procedure is flexible and can be applied to a wide range of desired properties. Finally, we prove that pre-trained models will always require alignment techniques even as model sizes or training data increase, as long as the training data contains even small biases.
翻译:人工智能模型对齐至关重要,原因在于训练数据中存在的无意识偏差以及机器学习流程的欠规范性——即使测试指标优异的模型也可能无法满足最终用户需求。虽然基于人类反馈的训练后对齐方法展现出潜力,但这些方法通常仅限于生成式AI场景,即人类能够解读模型输出并提供反馈。在输出为数值或类别的传统非生成式场景中,通过单样本输出检测错位仍具挑战性,而在训练期间强制对齐则需要重复昂贵的训练过程。本文提出一种替代策略:我们通过属性测试来阐释模型对齐,将已对齐模型$f$定义为属于函数子集$\mathcal{P}$的模型,该子集展现出特定的期望行为。我们专注于使用共形风险控制对预训练模型$f$进行后处理,以更好地与$\mathcal{P}$对齐。具体而言,我们开发了一种通用流程,可将测试给定属性$\mathcal{P}$的查询转化为适用于共形风险控制算法的损失函数集合。我们证明了所得$f$周围的共形区间包含近似满足$\mathcal{P}$函数的概率保证。我们在监督学习数据集集合上展示了该方法在(形状约束)属性(如单调性和凹性)方面的应用。该通用流程具有灵活性,可适用于广泛的期望属性。最后我们证明:只要训练数据包含即使微小的偏差,无论模型规模或训练数据如何增加,预训练模型始终需要对齐技术。