This paper presents CG-Eval, the first comprehensive evaluation of the generation capabilities of large Chinese language models across a wide range of academic disciplines. The models' performance was assessed based on their ability to generate accurate and relevant responses to different types of questions in six disciplines, namely, Science and Engineering, Humanities and Social Sciences, Mathematical Calculations, Medical Practitioner Qualification Examination, Judicial Examination, and Certified Public Accountant Examination. This paper also presents Gscore, a composite index derived from the weighted sum of multiple metrics to measure the quality of model's generation against a reference. The test data and test results can be found at http://cgeval.besteasy.com/.
翻译:本文提出CG-Eval,首个覆盖多学科领域的中文大语言模型生成能力综合评估。依据模型在六大类学科中对不同类型问题的准确性与相关性作答能力进行评测,涵盖理工科、人文社科、数学计算、执业医师资格考试、国家统一法律职业资格考试及注册会计师考试。本文同时提出Gscore指标,通过多指标加权求和构建复合指数,用于度量模型生成内容与参考答案的匹配质量。测试数据及结果详见 http://cgeval.besteasy.com/。