Automated medical image analysis systems often require large amounts of training data with high quality labels, which are difficult and time consuming to generate. This paper introduces Radiology Object in COntext version 2 (ROCOv2), a multimodal dataset consisting of radiological images and associated medical concepts and captions extracted from the PMC Open Access subset. It is an updated version of the ROCO dataset published in 2018, and adds 35,705 new images added to PMC since 2018. It further provides manually curated concepts for imaging modalities with additional anatomical and directional concepts for X-rays. The dataset consists of 79,789 images and has been used, with minor modifications, in the concept detection and caption prediction tasks of ImageCLEFmedical Caption 2023. The dataset is suitable for training image annotation models based on image-caption pairs, or for multi-label image classification using Unified Medical Language System (UMLS) concepts provided with each image. In addition, it can serve for pre-training of medical domain models, and evaluation of deep learning models for multi-task learning.
翻译:自动化医学图像分析系统通常需要大量具有高质量标注的训练数据,而这些数据的生成既困难又耗时。本文介绍了上下文放射学对象数据集第二版(ROCOv2),这是一个多模态数据集,包含从PMC开放获取子集中提取的放射学图像及相关医学概念与描述文本。该数据集是2018年发布的ROCO数据集的更新版本,新增了2018年以来加入PMC的35,705张新图像。此外,数据集为成像模态提供了人工标注的概念,并为X射线图像补充了解剖学与方位概念。该数据集共包含79,789张图像,并已应用于ImageCLEFmedical Caption 2023的概念检测与描述预测任务(经细微调整)。本数据集适用于基于图像-描述对训练图像标注模型,或利用每张图像附带的统一医学语言系统(UMLS)概念进行多标签图像分类。此外,它还可用于医学领域模型的预训练,以及多任务学习深度学习模型的评估。