Beliefs and values are increasingly being incorporated into our AI systems through alignment processes, such as carefully curating data collection principles or regularizing the loss function used for training. However, the meta-alignment problem is that these human beliefs are diverse and not aligned across populations; furthermore, the implicit strength of each belief may not be well calibrated even among humans, especially when trying to generalize across contexts. Specifically, in high regret situations, we observe that contextual counterfactuals and recourse costs are particularly important in updating a decision maker's beliefs and the strengths to which such beliefs are held. Therefore, we argue that including counterfactuals is key to an accurate calibration of beliefs during alignment. To do this, we first segment belief diversity into two categories: subjectivity (across individuals within a population) and epistemic uncertainty (within an individual across different contexts). By leveraging our notion of epistemic uncertainty, we introduce `the belief calibration cycle' framework to more holistically calibrate this diversity of beliefs with context-driven counterfactual reasoning by using a multi-objective optimization. We empirically apply our framework for finding a Pareto frontier of clustered optimal belief strengths that generalize across different contexts, demonstrating its efficacy on a toy dataset for credit decisions.
翻译:信念和价值观正通过对齐流程(例如精心策划数据收集原则或正则化训练损失函数)越来越多地被纳入人工智能系统。然而,元对齐问题在于,这些人类信念存在多样性,且在不同群体间无法对齐;此外,即使在人类内部,尤其在试图跨情境泛化时,每种信念的隐式强度也可能未能得到良好校准。具体而言,在高遗憾情境中,我们观察到上下文反事实和补救成本对于更新决策者的信念及其信念强度尤为关键。因此,我们认为,在对齐过程中纳入反事实是实现信念精确校准的核心。为此,我们首先将信念多样性分为两类:主观性(群体内个体间的差异)和认知不确定性(个体在不同情境间的差异)。通过利用我们的认知不确定性概念,我们引入了“信念校准循环”框架,该框架通过多目标优化,结合情境驱动的反事实推理,更全面地校准这种信念多样性。我们通过实证应用该框架,寻找能够跨不同情境泛化的聚类最优信念强度的帕累托前沿,并在一个用于信贷决策的玩具数据集上证明了其有效性。