We present NewsBench, a novel evaluation framework to systematically assess the capabilities of Large Language Models (LLMs) for editorial capabilities in Chinese journalism. Our constructed benchmark dataset is focused on four facets of writing proficiency and six facets of safety adherence, and it comprises manually and carefully designed 1,267 test samples in the types of multiple choice questions and short answer questions for five editorial tasks in 24 news domains. To measure performances, we propose different GPT-4 based automatic evaluation protocols to assess LLM generations for short answer questions in terms of writing proficiency and safety adherence, and both are validated by the high correlations with human evaluations. Based on the systematic evaluation framework, we conduct a comprehensive analysis of ten popular LLMs which can handle Chinese. The experimental results highlight GPT-4 and ERNIE Bot as top performers, yet reveal a relative deficiency in journalistic safety adherence in creative writing tasks. Our findings also underscore the need for enhanced ethical guidance in machine-generated journalistic content, marking a step forward in aligning LLMs with journalistic standards and safety considerations.
翻译:本文提出NewsBench,一个用于系统性评估大语言模型在中文新闻领域编辑能力的新型评估框架。我们构建的基准数据集聚焦于写作能力的四个维度和安全遵循的六个维度,包含针对24个新闻领域中五项编辑任务、精心人工设计的1,267个测试样本,题型涵盖多项选择题与简答题。为衡量模型性能,我们提出了多种基于GPT-4的自动评估方案,用于从写作能力与安全遵循两方面评估大语言模型在简答题上的生成结果,两种评估方式均通过与人工评估高度相关的验证。基于该系统性评估框架,我们对十款能够处理中文的主流大语言模型进行了全面分析。实验结果凸显GPT-4和ERNIE Bot表现最佳,但也揭示了在创意写作任务中新闻安全遵循能力的相对不足。我们的研究结果同时强调,需加强机器生成新闻内容的伦理引导,这标志着在使大语言模型符合新闻标准与安全考量方面迈出了重要一步。