Building pluralistic AI requires designing models that are able to be shaped to represent a wide range of value systems and cultures. Achieving this requires first being able to evaluate the degree to which a given model is capable of reflecting various personas. To this end, we propose a benchmark for evaluating the steerability of model personas as a function of prompting. Our design is based on a formal definition of prompt steerability, which analyzes the degree to which a model's joint behavioral distribution can be shifted from its baseline behavior. By defining steerability indices and inspecting how these indices change as a function of steering effort, we can estimate the steerability of a model across various persona dimensions and directions. Our benchmark reveals that the steerability of many current models is limited -- due to both a skew in their baseline behavior and an asymmetry in their steerability across many persona dimensions. We release an implementation of our benchmark at https://github.com/IBM/prompt-steering.
翻译:构建多元化人工智能需要设计能够适应多种价值体系与文化背景的模型。实现这一目标首先需要评估特定模型反映不同人物角色的能力程度。为此,我们提出一个基准框架,用于评估模型角色通过提示实现的可操控性。我们的设计基于对提示可操控性的形式化定义,该定义通过分析模型联合行为分布相较于基线行为的偏移程度来实现。通过定义可操控性指数并观察这些指数如何随操控力度的变化而变化,我们能够评估模型在不同角色维度及方向上的可操控性。我们的基准测试表明,当前许多模型的可操控性存在局限——这既源于其基线行为的偏态分布,也因其在多个人物角色维度上表现出的可操控性不对称。我们在 https://github.com/IBM/prompt-steering 发布了该基准框架的实现代码。