This study presents NewsBench, a novel benchmark framework developed to evaluate the capability of Large Language Models (LLMs) in Chinese Journalistic Writing Proficiency (JWP) and their Safety Adherence (SA), addressing the gap between journalistic ethics and the risks associated with AI utilization. Comprising 1,267 tasks across 5 editorial applications, 7 aspects (including safety and journalistic writing with 4 detailed facets), and spanning 24 news topics domains, NewsBench employs two GPT-4 based automatic evaluation protocols validated by human assessment. Our comprehensive analysis of 10 LLMs highlighted GPT-4 and ERNIE Bot as top performers, yet revealed a relative deficiency in journalistic ethic adherence during creative writing tasks. These findings underscore the need for enhanced ethical guidance in AI-generated journalistic content, marking a step forward in aligning AI capabilities with journalistic standards and safety considerations.
翻译:本研究提出新闻基准(NewsBench),这是一个新型基准框架,用于评估大语言模型(LLMs)在中文新闻写作能力(JWP)及其安全遵循性(SA)方面的表现,旨在填补新闻伦理与人工智能应用风险之间的鸿沟。该框架涵盖5类编辑应用、7个评估维度(包括安全性与新闻写作下4个细项),覆盖24个新闻主题领域,采用基于GPT-4的两种自动评估协议,并经人工评估验证。通过对10个大语言模型的全面分析,GPT-4与文心一言表现最佳,但在创造性写作任务中相对缺乏对新闻伦理的遵循。这些发现凸显了需要加强对AI生成新闻内容的伦理指导,标志着在使AI能力与新闻标准及安全考量对齐方面迈出了重要一步。