Generative AI is rapidly moving from research to deployment, elevating the need for responsible development, evaluation, and governance. We conduct a PRISMA guided review of 232 studies (November 2022 - December 2025), spanning large language models, vision language models, diffusion models, and agentic pipelines. We make four contributions: (1) the first survey bridging governance principles, technical evaluation, and domain deployment across all four system types; (2) a ten-criterion rubric (C1-C10) scoring major AI safety benchmarks on risk-surface coverage, paired with a policy crosswalk mapping benchmarks to regulatory requirements; (3) twelve lifecycle KPIs, explainability guidance for foundation models, and a testbed catalogue; and (4) domain-specific analysis across healthcare, finance, education, arts, agriculture, and defense. Three findings emerge: benchmark coverage is dense for bias and toxicity but sparse for privacy, provenance, deepfakes, and system-level failures in agentic settings; evaluations remain largely static and task local, limiting audit portability; and inconsistent documentation complicates cross-release comparison. We outline a research agenda prioritizing adaptive multimodal evaluation, privacy and provenance testing, deepfake risk assessment, calibration reporting, versioned artifacts, and continuous monitoring. This survey offers a structured path to align generative AI evaluation with governance needs for safe and accountable deployment.
翻译:生成式人工智能正迅速从研究走向部署,这提升了对负责任开发、评估与治理的需求。我们遵循PRISMA指南,对232项研究(2022年11月至2025年12月)进行了系统性回顾,涵盖大语言模型、视觉语言模型、扩散模型以及智能体流程。本文作出四项贡献:(1)首次跨越所有四种系统类型,连接治理原则、技术评估与领域部署的综述;(2)一套包含十个标准(C1-C10)的评分体系,依据风险面覆盖度对主要AI安全基准进行评分,并辅以将基准映射至监管要求的政策对照表;(3)十二项生命周期关键绩效指标、基础模型可解释性指南及一个测试平台目录;(4)针对医疗、金融、教育、艺术、农业和国防领域的特定领域分析。我们得出三项主要发现:基准测试对偏见和毒性覆盖密集,但对隐私、溯源、深度伪造以及智能体场景中的系统级故障覆盖稀疏;评估大多仍为静态且局限于任务层面,限制了审计的可移植性;不一致的文档记录使得跨版本比较变得复杂。我们提出了一个研究议程,优先关注自适应多模态评估、隐私与溯源测试、深度伪造风险评估、校准报告、版本化制品以及持续监控。本综述为将生成式人工智能评估与安全、可问责部署的治理需求对齐,提供了一条结构化路径。