The widespread of generative artificial intelligence has heightened concerns about the potential harms posed by AI-generated texts, primarily stemming from factoid, unfair, and toxic content. Previous researchers have invested much effort in assessing the harmlessness of generative language models. However, existing benchmarks are struggling in the era of large language models (LLMs), due to the stronger language generation and instruction following capabilities, as well as wider applications. In this paper, we propose FFT, a new benchmark with 2116 elaborated-designed instances, for LLM harmlessness evaluation with factuality, fairness, and toxicity. To investigate the potential harms of LLMs, we evaluate 9 representative LLMs covering various parameter scales, training stages, and creators. Experiments show that the harmlessness of LLMs is still under-satisfactory, and extensive analysis derives some insightful findings that could inspire future research for harmless LLM research.
翻译:生成式人工智能的广泛普及引发了对AI生成文本潜在危害的持续关注,此类危害主要源于事实错误、不公正及有毒内容。此前研究者已在生成式语言模型无害性评估方面投入大量工作,然而现有基准在大语言模型时代面临挑战——由于模型具备更强的语言生成与指令遵循能力,且应用场景更为广泛。本文提出FFT基准,包含2116个精心设计的实例,从事实性、公平性与毒性三个维度评估大语言模型的无害性。为探究大语言模型的潜在危害,我们评估了9个代表性模型,涵盖不同参数量级、训练阶段与开发机构。实验表明,当前大语言模型的无害性仍不尽如人意,通过深入分析获得的若干启发性发现,可为未来无害大语言模型研究提供参考。