Evaluations of generative AI models often collapse nuanced behaviour into a single number computed for a single decoding configuration. Such point estimates obscure tail risks, demographic disparities, and the existence of multiple near-optimal operating points. We propose a unified framework that embraces multiplicity by modelling the distribution of harmful behaviour across the entire space of decoding knobs and prompts, quantifying risk through tail-focused metrics, and integrating stakeholder preferences. Our technical contributions are threefold: (i) we formalise decoding Rashomon sets, regions of knob space whose risk is near-optimal under given criteria and measure their size and disagreement; (ii) we develop a dependent Dirichlet process (DDP) mixture with stakeholder-conditioned stick-breaking weights to learn multi-modal harm surfaces; and (iii) we introduce an active sampling pipeline that uses Bayesian deep learning surrogates to explore knob space efficiently. Our approach bridges multiplicity theory, Bayesian nonparametrics, and stakeholder-aligned sensitivity analysis, paving the way for trustworthy deployment of generative models.
翻译:生成式人工智能模型的评估常将复杂行为简化为单一解码配置下的单一数值。此类点估计掩盖了尾部风险、人口统计学差异以及多个近优操作点的存在。我们提出一个统一框架,通过以下方式接纳多重性:建模整个解码参数与提示空间中有害行为的分布,采用尾部聚焦指标量化风险,并整合利益相关者偏好。我们的技术贡献包括三方面:(一)形式化解码Rashomon集合——在给定准则下风险接近最优的参数空间区域,并度量其规模与分歧度;(二)开发具有利益相关者条件化折棍权重的依赖狄利克雷过程混合模型,以学习多模态危害曲面;(三)提出主动采样流程,利用贝叶斯深度学习代理模型高效探索参数空间。本方法融合了多重性理论、贝叶斯非参数方法与利益相关者导向的敏感性分析,为生成模型的可信部署开辟了新路径。