ExplainCPE: A Free-text Explanation Benchmark of Chinese Pharmacist Examination

As ChatGPT and GPT-4 spearhead the development of Large Language Models (LLMs), more researchers are investigating their performance across various tasks. But more research needs to be done on the interpretability capabilities of LLMs, that is, the ability to generate reasons after an answer has been given. Existing explanation datasets are mostly English-language general knowledge questions, which leads to insufficient thematic and linguistic diversity. To address the language bias and lack of medical resources in generating rationales QA datasets, we present ExplainCPE (over 7k instances), a challenging medical benchmark in Simplified Chinese. We analyzed the errors of ChatGPT and GPT-4, pointing out the limitations of current LLMs in understanding text and computational reasoning. During the experiment, we also found that different LLMs have different preferences for in-context learning. ExplainCPE presents a significant challenge, but its potential for further investigation is promising, and it can be used to evaluate the ability of a model to generate explanations. AI safety and trustworthiness need more attention, and this work makes the first step to explore the medical interpretability of LLMs.The dataset is available at https://github.com/HITsz-TMG/ExplainCPE.

翻译：随着ChatGPT和GPT-4引领大型语言模型（LLMs）的发展，越来越多的研究者正在探究它们在不同任务中的表现。然而，对于LLMs的可解释性能力（即在给出答案后生成理由的能力）仍需更多研究。现有的解释数据集主要为英文通用知识问题，导致主题和语言多样性不足。为解决生成理由的问答数据集中的语言偏差和医疗资源匮乏问题，我们提出了ExplainCPE（超过7千个实例），这是一个简体中文的医学基准数据集。我们分析了ChatGPT和GPT-4的错误，指出了当前LLMs在文本理解和计算推理方面的局限性。实验过程中，我们还发现不同LLMs对上下文学习具有不同的偏好。ExplainCPE构成了一个重大挑战，但其进一步研究潜力巨大，可用于评估模型生成解释的能力。人工智能的安全性和可信赖性需要更多关注，本工作为探索LLMs的医学可解释性迈出了第一步。该数据集可在 https://github.com/HITsz-TMG/ExplainCPE 获取。

相关内容

GPT-4

关注 29

北京时间2023年3月15日凌晨，ChatGPT开发商OpenAI 发布了发布了全新的多模态预训练大模型 GPT-4，可以更可靠、更具创造力、能处理更细节的指令，根据图片和文字提示都能生成相应内容。具体来说来说，GPT-4 相比上一代的模型，实现了飞跃式提升：支持图像和文本输入，拥有强大的识图能力；大幅提升了文字输入限制，在ChatGPT模式下，GPT-4可以处理超过2.5万字的文本，可以处理一些更加细节的指令；回答准确性也得到了显著提高。

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日