We introduce time-to-unsafe-sampling, a novel safety measure for generative models, defined as the number of generations required by a large language model (LLM) to trigger an unsafe (e.g., toxic) response. While providing a new dimension for prompt-adaptive safety evaluation, quantifying time-to-unsafe-sampling is challenging: unsafe outputs are often rare in well-aligned models and thus may not be observed under any feasible sampling budget. To address this challenge, we frame this estimation problem as one of survival analysis. We build on recent developments in conformal prediction and propose a novel calibration technique to construct a lower predictive bound (LPB) on the time-to-unsafe-sampling of a given prompt with rigorous coverage guarantees. Our key technical innovation is an optimized sampling-budget allocation scheme that improves sample efficiency while maintaining distribution-free guarantees. Experiments on both synthetic and real data support our theoretical results and demonstrate the practical utility of our method for safety risk assessment in generative AI models.
翻译:本文提出“不安全采样时间”这一生成模型安全新度量,定义为大语言模型(LLM)触发不安全(如有害)响应所需生成的次数。该指标为提示自适应安全评估提供了新维度,但其量化面临挑战:在良好对齐的模型中,不安全输出往往罕见,因此在任何可行采样预算下都可能无法观测到。为解决此难题,我们将该估计问题构建为生存分析框架。基于近期保形预测的研究进展,我们提出一种新颖的校准技术,为给定提示的不安全采样时间构建具有严格覆盖保证的预测下界(LPB)。核心技术创新在于优化采样预算分配方案,在保持无分布保证的同时提升采样效率。合成数据与真实数据的实验验证了理论结果,并证明该方法在生成式AI模型安全风险评估中的实用价值。