We present NewsBench, a novel evaluation framework to systematically assess the capabilities of Large Language Models (LLMs) for editorial capabilities in Chinese journalism. Our constructed benchmark dataset is focused on four facets of writing proficiency and six facets of safety adherence, and it comprises manually and carefully designed 1,267 test samples in the types of multiple choice questions and short answer questions for five editorial tasks in 24 news domains. To measure performances, we propose different GPT-4 based automatic evaluation protocols to assess LLM generations for short answer questions in terms of writing proficiency and safety adherence, and both are validated by the high correlations with human evaluations. Based on the systematic evaluation framework, we conduct a comprehensive analysis of ten popular LLMs which can handle Chinese. The experimental results highlight GPT-4 and ERNIE Bot as top performers, yet reveal a relative deficiency in journalistic safety adherence in creative writing tasks. Our findings also underscore the need for enhanced ethical guidance in machine-generated journalistic content, marking a step forward in aligning LLMs with journalistic standards and safety considerations.
翻译:本文提出NewsBench,一种新颖的评估框架,旨在系统性评估大型语言模型在中文新闻领域的编辑能力。我们构建的基准数据集聚焦于写作能力的四个维度和安全遵循的六个维度,包含针对24个新闻领域中五项编辑任务、经人工精心设计的1,267个测试样本,题型包括选择题与简答题。为量化模型表现,我们提出了多种基于GPT-4的自动评估方案,用于从写作能力与安全遵循两个维度评估大型语言模型在简答题上的生成结果,两种评估方式均通过与人工评估的高度相关性验证。基于该系统性评估框架,我们对十款支持中文的主流大型语言模型进行了全面分析。实验结果凸显GPT-4和文心一言的领先表现,同时揭示了在创意写作任务中新闻安全遵循能力的相对不足。我们的研究结果进一步强调,在机器生成的新闻内容中需要加强伦理引导,这标志着大型语言模型与新闻规范及安全考量对齐的重要进展。