As ChatGPT and GPT-4 spearhead the development of Large Language Models (LLMs), more researchers are investigating their performance across various tasks. But more research needs to be done on the interpretability capabilities of LLMs, that is, the ability to generate reasons after an answer has been given. Existing explanation datasets are mostly English-language general knowledge questions, which leads to insufficient thematic and linguistic diversity. To address the language bias and lack of medical resources in generating rationales QA datasets, we present ExplainCPE (over 7k instances), a challenging medical benchmark in Simplified Chinese. We analyzed the errors of ChatGPT and GPT-4, pointing out the limitations of current LLMs in understanding text and computational reasoning. During the experiment, we also found that different LLMs have different preferences for in-context learning. ExplainCPE presents a significant challenge, but its potential for further investigation is promising, and it can be used to evaluate the ability of a model to generate explanations. AI safety and trustworthiness need more attention, and this work makes the first step to explore the medical interpretability of LLMs.The dataset is available at https://github.com/HITsz-TMG/ExplainCPE.
翻译:随着ChatGPT和GPT-4引领大型语言模型(LLMs)的发展,越来越多的研究者正在探究它们在不同任务中的表现。然而,对于LLMs的可解释性能力(即在给出答案后生成理由的能力)仍需更多研究。现有的解释数据集主要为英文通用知识问题,导致主题和语言多样性不足。为解决生成理由的问答数据集中的语言偏差和医疗资源匮乏问题,我们提出了ExplainCPE(超过7千个实例),这是一个简体中文的医学基准数据集。我们分析了ChatGPT和GPT-4的错误,指出了当前LLMs在文本理解和计算推理方面的局限性。实验过程中,我们还发现不同LLMs对上下文学习具有不同的偏好。ExplainCPE构成了一个重大挑战,但其进一步研究潜力巨大,可用于评估模型生成解释的能力。人工智能的安全性和可信赖性需要更多关注,本工作为探索LLMs的医学可解释性迈出了第一步。该数据集可在 https://github.com/HITsz-TMG/ExplainCPE 获取。