This paper presents CG-Eval, the first comprehensive evaluation of the generation capabilities of large Chinese language models across a wide range of academic disciplines. The models' performance was assessed based on their ability to generate accurate and relevant responses to different types of questions in six disciplines, namely, Science and Engineering, Humanities and Social Sciences, Mathematical Calculations, Medical Practitioner Qualification Examination, Judicial Examination, and Certified Public Accountant Examination. This paper also presents Gscore, a composite index derived from the weighted sum of multiple metrics to measure the quality of model's generation against a reference. The test data and test results can be found at http://cgeval.besteasy.com/.
翻译:本文提出CG-Eval,这是首个对大型中文语言模型在广泛学科领域内生成能力进行的全面评估。通过评估模型在六大类学科(科学与工程、人文社会科学、数学计算、医师资格考试、司法考试及注册会计师考试)中针对不同问题生成准确且相关响应的能力,对模型性能进行了衡量。本文还引入了Gscore,这是一个由多项指标加权求和构成的综合指数,用于度量模型生成内容与参考答案之间的质量。测试数据及结果详见 http://cgeval.besteasy.com/。