Aligning Model Properties via Conformal Risk Control

AI model alignment is crucial due to inadvertent biases in training data and the underspecified machine learning pipeline, where models with excellent test metrics may not meet end-user requirements. While post-training alignment via human feedback shows promise, these methods are often limited to generative AI settings where humans can interpret and provide feedback on model outputs. In traditional non-generative settings with numerical or categorical outputs, detecting misalignment through single-sample outputs remains challenging, and enforcing alignment during training requires repeating costly training processes. In this paper we consider an alternative strategy. We propose interpreting model alignment through property testing, defining an aligned model $f$ as one belonging to a subset $\mathcal{P}$ of functions that exhibit specific desired behaviors. We focus on post-processing a pre-trained model $f$ to better align with $\mathcal{P}$ using conformal risk control. Specifically, we develop a general procedure for converting queries for testing a given property $\mathcal{P}$ to a collection of loss functions suitable for use in a conformal risk control algorithm. We prove a probabilistic guarantee that the resulting conformal interval around $f$ contains a function approximately satisfying $\mathcal{P}$. We exhibit applications of our methodology on a collection of supervised learning datasets for (shape-constrained) properties such as monotonicity and concavity. The general procedure is flexible and can be applied to a wide range of desired properties. Finally, we prove that pre-trained models will always require alignment techniques even as model sizes or training data increase, as long as the training data contains even small biases.

翻译：人工智能模型对齐至关重要，原因在于训练数据中存在的无意识偏差以及机器学习流程的欠规范性——即使测试指标优异的模型也可能无法满足最终用户需求。虽然基于人类反馈的训练后对齐方法展现出潜力，但这些方法通常仅限于生成式AI场景，即人类能够解读模型输出并提供反馈。在输出为数值或类别的传统非生成式场景中，通过单样本输出检测错位仍具挑战性，而在训练期间强制对齐则需要重复昂贵的训练过程。本文提出一种替代策略：我们通过属性测试来阐释模型对齐，将已对齐模型$f$定义为属于函数子集$\mathcal{P}$的模型，该子集展现出特定的期望行为。我们专注于使用共形风险控制对预训练模型$f$进行后处理，以更好地与$\mathcal{P}$对齐。具体而言，我们开发了一种通用流程，可将测试给定属性$\mathcal{P}$的查询转化为适用于共形风险控制算法的损失函数集合。我们证明了所得$f$周围的共形区间包含近似满足$\mathcal{P}$函数的概率保证。我们在监督学习数据集集合上展示了该方法在（形状约束）属性（如单调性和凹性）方面的应用。该通用流程具有灵活性，可适用于广泛的期望属性。最后我们证明：只要训练数据包含即使微小的偏差，无论模型规模或训练数据如何增加，预训练模型始终需要对齐技术。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日