We introduce CHARM, the first benchmark for comprehensively and in-depth evaluating the commonsense reasoning ability of large language models (LLMs) in Chinese, which covers both globally known and Chinese-specific commonsense. We evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5 representative prompt strategies for improving LLMs' reasoning ability, such as Chain-of-Thought. Our findings indicate that the LLM's language orientation and the task's domain influence the effectiveness of the prompt strategy, which enriches previous research findings. We built closely-interconnected reasoning and memorization tasks, and found that some LLMs struggle with memorizing Chinese commonsense, affecting their reasoning ability, while others show differences in reasoning despite similar memorization performance. We also evaluated the LLMs' memorization-independent reasoning abilities and analyzed the typical errors. Our study precisely identified the LLMs' strengths and weaknesses, providing the clear direction for optimization. It can also serve as a reference for studies in other fields. We will release CHARM at https://github.com/opendatalab/CHARM .
翻译:我们提出了CHARM,首个全面深入评估大语言模型在中文语境下常识推理能力的基准测试,涵盖全球通用常识与中文特有常识。我们使用5种代表性提示策略(如思维链)对7个英文导向和12个中文导向的大语言模型进行评测。研究发现:大语言模型的语言导向与任务领域共同影响提示策略的有效性,这深化了既有研究结论。通过构建紧密关联的推理与记忆任务,我们发现部分模型在中文常识记忆方面存在困难并影响其推理能力,而另一些模型在记忆表现相近时仍呈现推理差异。我们进一步评估了模型独立于记忆的推理能力并分析典型错误类型。本研究精准定位了大语言模型的优势与不足,为模型优化提供明确方向,也可为其他领域研究提供参照。CHARM基准已发布于https://github.com/opendatalab/CHARM。