We introduce time-to-unsafe-sampling, a novel safety measure for generative models, defined as the number of generations required by a large language model (LLM) to trigger an unsafe (e.g., toxic) response. While providing a new dimension for prompt-adaptive safety evaluation, quantifying time-to-unsafe-sampling is challenging: unsafe outputs are often rare in well-aligned models and thus may not be observed under any feasible sampling budget. To address this challenge, we frame this estimation problem as one of survival analysis. We build on recent developments in conformal prediction and propose a novel calibration technique to construct a lower predictive bound (LPB) on the time-to-unsafe-sampling of a given prompt with rigorous coverage guarantees. Our key technical innovation is an optimized sampling-budget allocation scheme that improves sample efficiency while maintaining distribution-free guarantees. Experiments on both synthetic and real data support our theoretical results and demonstrate the practical utility of our method for safety risk assessment in generative AI models.
翻译:本文提出了一种生成模型的新型安全度量指标——不安全采样时间,其定义为大型语言模型(LLM)触发不安全(如有害)响应所需的生成次数。该指标为提示自适应安全评估提供了新的维度,但量化不安全采样时间面临挑战:在良好对齐的模型中,不安全输出通常极为罕见,因此在任何可行的采样预算下都可能无法观测到。为应对这一挑战,我们将该估计问题构建为生存分析问题。基于近期在保形预测方面的进展,我们提出了一种新颖的校准技术,可为给定提示的不安全采样时间构建具有严格覆盖保证的预测下界。我们的核心技术创新是一种优化的采样预算分配方案,该方案在保持无分布保证的同时显著提升了采样效率。在合成数据与真实数据上的实验验证了理论结果,并证明了该方法在生成式AI模型安全风险评估中的实际效用。