We introduce CHARM, the first benchmark for comprehensively and in-depth evaluating the commonsense reasoning ability of large language models (LLMs) in Chinese, which covers both globally known and Chinese-specific commonsense. We evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5 representative prompt strategies for improving LLMs' reasoning ability, such as Chain-of-Thought. Our findings indicate that the LLM's language orientation and the task's domain influence the effectiveness of the prompt strategy, which enriches previous research findings. We built closely-interconnected reasoning and memorization tasks, and found that some LLMs struggle with memorizing Chinese commonsense, affecting their reasoning ability, while others show differences in reasoning despite similar memorization performance. We also evaluated the LLMs' memorization-independent reasoning abilities and analyzed the typical errors. Our study precisely identified the LLMs' strengths and weaknesses, providing the clear direction for optimization. It can also serve as a reference for studies in other fields. We will release CHARM at https://github.com/opendatalab/CHARM .
翻译:我们提出CHARM,首个全面深入评估大语言模型(LLM)中文常识推理能力的基准测试,其涵盖全球性通用常识与中文特有常识。我们在CHARM上评估了7个英文LLM和12个中文导向LLM,采用5种代表性提示策略提升LLM的推理能力(如思维链)。研究表明,LLM的语言偏向与任务领域会影响提示策略的有效性,这一发现丰富了现有研究成果。我们构建了紧密关联的推理与记忆任务,发现部分LLM在中文常识记忆方面存在困难,进而影响其推理能力;而另一些LLM在记忆表现相近的情况下,推理能力却呈现差异。我们还评估了LLM独立于记忆的推理能力,并分析了典型错误类型。本研究精准识别了LLM的优势与不足,为优化提供了明确方向,同时可为其他领域研究提供参考。我们将于https://github.com/opendatalab/CHARM 公开发布CHARM。