emrQA-msquad: A Medical Dataset Structured with the SQuAD V2.0 Framework, Enriched with emrQA Medical Information

Machine Reading Comprehension (MRC) holds a pivotal role in shaping Medical Question Answering Systems (QAS) and transforming the landscape of accessing and applying medical information. However, the inherent challenges in the medical field, such as complex terminology and question ambiguity, necessitate innovative solutions. One key solution involves integrating specialized medical datasets and creating dedicated datasets. This strategic approach enhances the accuracy of QAS, contributing to advancements in clinical decision-making and medical research. To address the intricacies of medical terminology, a specialized dataset was integrated, exemplified by a novel Span extraction dataset derived from emrQA but restructured into 163,695 questions and 4,136 manually obtained answers, this new dataset was called emrQA-msquad dataset. Additionally, for ambiguous questions, a dedicated medical dataset for the Span extraction task was introduced, reinforcing the system's robustness. The fine-tuning of models such as BERT, RoBERTa, and Tiny RoBERTa for medical contexts significantly improved response accuracy within the F1-score range of 0.75 to 1.00 from 10.1% to 37.4%, 18.7% to 44.7% and 16.0% to 46.8%, respectively. Finally, emrQA-msquad dataset is publicy available at https://huggingface.co/datasets/Eladio/emrqa-msquad.

翻译：机器阅读理解（MRC）在构建医学问答系统（QAS）以及改变医学信息获取与应用方式中起着关键作用。然而，医学领域固有的挑战，如复杂的术语和问题歧义性，需要创新的解决方案。其中一个关键方案涉及整合专业医学数据集并创建专用数据集。这一策略性方法提升了QAS的准确性，有助于临床决策和医学研究的进步。为应对医学术语的复杂性，我们整合了一个专业数据集，具体表现为从emrQA派生但重新构建的跨度抽取数据集，包含163,695个问题及4,136个手动获取的答案，这一新数据集被命名为emrQA-msquad数据集。此外，针对歧义性问题，我们引入了一个用于跨度抽取任务的专用医学数据集，增强了系统的鲁棒性。针对医学语境微调BERT、RoBERTa和Tiny RoBERTa等模型后，响应准确性在F1分数0.75至1.00范围内分别从10.1%提升至37.4%、从18.7%提升至44.7%以及从16.0%提升至46.8%。最后，emrQA-msquad数据集已在https://huggingface.co/datasets/Eladio/emrqa-msquad 公开提供。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

14+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日