Who is Responsible? The Data, Models, Users or Regulations? A Comprehensive Survey on Responsible Generative AI for a Sustainable Future

Shaina Raza,Rizwan Qureshi,Anam Zahid,Amgad Muneer,Anas Zafar,Safiullah Kamawal,Ferhat Sadak,Joseph Fioresi,Muhammaed Saeed,Ranjan Sapkota,Aditya Jain,Muneeb Ul Hassan,Aizan Zafar,Hasan Maqbool,Ashmal Vayani,Jia Wu,Maged Shoman

from arxiv, under review

Generative AI is rapidly moving from research to deployment, elevating the need for responsible development, evaluation, and governance. We conduct a PRISMA guided review of 232 studies (November 2022 - December 2025), spanning large language models, vision language models, diffusion models, and agentic pipelines. We make four contributions: (1) the first survey bridging governance principles, technical evaluation, and domain deployment across all four system types; (2) a ten-criterion rubric (C1-C10) scoring major AI safety benchmarks on risk-surface coverage, paired with a policy crosswalk mapping benchmarks to regulatory requirements; (3) twelve lifecycle KPIs, explainability guidance for foundation models, and a testbed catalogue; and (4) domain-specific analysis across healthcare, finance, education, arts, agriculture, and defense. Three findings emerge: benchmark coverage is dense for bias and toxicity but sparse for privacy, provenance, deepfakes, and system-level failures in agentic settings; evaluations remain largely static and task local, limiting audit portability; and inconsistent documentation complicates cross-release comparison. We outline a research agenda prioritizing adaptive multimodal evaluation, privacy and provenance testing, deepfake risk assessment, calibration reporting, versioned artifacts, and continuous monitoring. This survey offers a structured path to align generative AI evaluation with governance needs for safe and accountable deployment.

翻译：生成式人工智能正迅速从研究走向部署，这提升了对负责任开发、评估与治理的需求。我们遵循PRISMA指南，对232项研究（2022年11月至2025年12月）进行了系统性回顾，涵盖大语言模型、视觉语言模型、扩散模型以及智能体流程。本文作出四项贡献：（1）首次跨越所有四种系统类型，连接治理原则、技术评估与领域部署的综述；（2）一套包含十个标准（C1-C10）的评分体系，依据风险面覆盖度对主要AI安全基准进行评分，并辅以将基准映射至监管要求的政策对照表；（3）十二项生命周期关键绩效指标、基础模型可解释性指南及一个测试平台目录；（4）针对医疗、金融、教育、艺术、农业和国防领域的特定领域分析。我们得出三项主要发现：基准测试对偏见和毒性覆盖密集，但对隐私、溯源、深度伪造以及智能体场景中的系统级故障覆盖稀疏；评估大多仍为静态且局限于任务层面，限制了审计的可移植性；不一致的文档记录使得跨版本比较变得复杂。我们提出了一个研究议程，优先关注自适应多模态评估、隐私与溯源测试、深度伪造风险评估、校准报告、版本化制品以及持续监控。本综述为将生成式人工智能评估与安全、可问责部署的治理需求对齐，提供了一条结构化路径。