AI safety is a rapidly growing area of research that seeks to prevent the harm and misuse of frontier AI technology, particularly with respect to generative AI (GenAI) tools that are capable of creating realistic and high-quality content through text prompts. Examples of such tools include large language models (LLMs) and text-to-image (T2I) diffusion models. As the performance of various leading GenAI models approaches saturation due to similar training data sources and neural network architecture designs, the development of reliable safety guardrails has become a key differentiator for responsibility and sustainability. This paper presents a formalization of the concept of computational safety, which is a mathematical framework that enables the quantitative assessment, formulation, and study of safety challenges in GenAI through the lens of signal processing theory and methods. In particular, we explore two exemplary categories of computational safety challenges in GenAI that can be formulated as hypothesis testing problems. For the safety of model input, we show how sensitivity analysis and loss landscape analysis can be used to detect malicious prompts with jailbreak attempts. For the safety of model output, we elucidate how statistical signal processing can be used to detect AI-generated content. Finally, we discuss key open research challenges, opportunities, and the essential role of signal processing in computational AI safety.
翻译:人工智能安全性是快速发展的研究方向,旨在防止前沿AI技术(尤其是能够通过文本提示生成逼真高质量内容的生成式AI工具)的滥用与危害。这类工具包括大语言模型和文生图扩散模型。由于各主流生成式AI模型在类似训练数据源和神经网络架构设计下性能趋于饱和,建立可靠的安全护栏已成为衡量责任担当与可持续性的关键差异化要素。本文提出计算安全性的形式化定义——这是一种数学框架,通过信号处理理论与方法的视角,实现对生成式AI安全挑战的定量评估、建模与研究。具体而言,我们探讨了可建模为假设检验问题的两类生成式AI计算安全典型挑战。针对模型输入安全,展示了如何利用敏感性分析和损失景观分析检测带越狱尝试的恶意提示;针对模型输出安全,阐明了如何运用统计信号处理检测AI生成内容。最后,我们讨论了关键开放性研究挑战、机遇以及信号处理在计算AI安全性中的核心作用。