In this paper, we introduce S-MedQA, an English medical question-answering (QA) dataset designed for benchmarking large language models (LLMs) in fine-grained clinical specialties. S-MedQA consists of over 24k examples, covering 15 medical specialties, with QA pairs that can have multiple specialty annotations, such as when a question is cross-disciplinary. The dataset is constructed using both machine and expert verification to maximize data availability and reliability. We use S-MedQA to investigate the role of clinical specialties in the knowledge-intensive scenario of medical QA. Our results show that training on data from a clinical specialty does not necessarily lead to the best performance on that specialty. Additionally, regardless of the specialty the LLM was fine-tuned on, token probabilities of clinically relevant terms consistently increase across all specialties. Based on these findings, we hypothesize that improvement gains, at least in our settings, are derived primarily from domain shifting (e.g., general to medical) rather than from injecting specialty-specific knowledge. This suggests a need to rethink the role of fine-tuning data in the medical domain. To encourage further advancements in the clinical NLP field, we release S-MedQA along with all the code required to reproduce our experiments for the research community.
翻译:本文介绍了S-MedQA,一个专为评估大语言模型在细粒度临床专科中表现而设计的英文医学问答数据集。S-MedQA包含超过24,000个示例,涵盖15个医学专科,其问答对可能具有多个专科标注(例如跨学科问题)。该数据集通过机器与专家双重验证构建,以最大化数据可用性与可靠性。我们利用S-MedQA探究临床专科在知识密集型的医学问答场景中的作用。实验结果表明,使用特定临床专科数据进行训练并不一定在该专科上获得最佳性能。此外,无论大语言模型基于何种专科进行微调,临床相关术语的词元概率在所有专科中均呈现一致增长。基于这些发现,我们假设性能提升(至少在我们的实验设置中)主要源于领域迁移(例如从通用领域到医学领域),而非来自专科特定知识的注入。这提示我们需要重新思考微调数据在医学领域中的作用。为促进临床自然语言处理领域的进一步发展,我们向研究社区开源S-MedQA数据集及全部实验复现代码。