Aligning Model Properties via Conformal Risk Control

AI model alignment is crucial due to inadvertent biases in training data and the underspecified pipeline in modern machine learning, where numerous models with excellent test set metrics can be produced, yet they may not meet end-user requirements. Recent advances demonstrate that post-training model alignment via human feedback can address some of these challenges. However, these methods are often confined to settings (such as generative AI) where humans can interpret model outputs and provide feedback. In traditional non-generative settings, where model outputs are numerical values or classes, detecting misalignment through single-sample outputs is highly challenging. In this paper we consider an alternative strategy. We propose interpreting model alignment through property testing, defining an aligned model $f$ as one belonging to a subset $\mathcal{P}$ of functions that exhibit specific desired behaviors. We focus on post-processing a pre-trained model $f$ to better align with $\mathcal{P}$ using conformal risk control. Specifically, we develop a general procedure for converting queries for a given property $\mathcal{P}$ to a collection of loss functions suitable for use in a conformal risk control algorithm. We prove a probabilistic guarantee that the resulting conformal interval around $f$ contains a function approximately satisfying $\mathcal{P}$. Given the capabilities of modern AI models with extensive parameters and training data, one might assume alignment issues will resolve naturally. However, increasing training data or parameters in a random feature model doesn't eliminate the need for alignment techniques when pre-training data is biased. We demonstrate our alignment methodology on supervised learning datasets for properties like monotonicity and concavity. Our flexible procedure can be applied to various desired properties.

翻译：AI模型对齐至关重要，原因在于训练数据中存在的无意识偏见以及现代机器学习中定义不明确的流程——尽管可以产生众多测试集指标优异的模型，但这些模型可能无法满足最终用户需求。近期研究表明，通过人类反馈进行训练后模型对齐能够应对部分挑战。然而，这些方法通常局限于人类可解读模型输出并提供反馈的场景（例如生成式AI）。在传统非生成式场景中，当模型输出为数值或类别时，通过单样本输出检测错位极具挑战性。本文探讨一种替代策略：我们提出通过属性测试来阐释模型对齐，将对齐模型$f$定义为属于函数子集$\mathcal{P}$的模型，该子集展现出特定的期望行为。我们专注于使用共形风险控制对预训练模型$f$进行后处理，以更好地与$\mathcal{P}$对齐。具体而言，我们开发了一种通用流程，将给定属性$\mathcal{P}$的查询转换为适用于共形风险控制算法的损失函数集合。我们证明了所得共形区间围绕$f$的概率保证，该区间包含近似满足$\mathcal{P}$的函数。鉴于现代AI模型具有海量参数和训练数据的能力，人们可能认为对齐问题会自然解决。然而，在随机特征模型中增加训练数据或参数，并不能消除预训练数据存在偏见时对对齐技术的需求。我们在监督学习数据集上展示了针对单调性和凹性等属性的对齐方法。我们的灵活流程可适用于多种期望属性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日