Vision-language models hold considerable promise for ophthalmology, but their development depends on large-scale, high-quality image-text datasets that remain scarce. We present PubMed-Ophtha, a hierarchical dataset of 102,023 ophthalmological image-caption pairs extracted from 15,842 open-access articles in PubMed Central. Unlike existing datasets, figures are extracted directly from article PDFs at full resolution and decomposed into their constituent panels, panel identifiers, and individual images. Each image is annotated with its imaging modality -- color fundus photography, optical coherence tomography, retinal imaging, or other -- and a mark status indicating the presence of annotation marks such as arrows. Figure captions are split into panel-level subcaptions using a two-step LLM approach, achieving a mean average sentence BLEU score of 0.913 on human-annotated data. Panel and image detection models reach a [email protected] of 0.909 and 0.892, respectively, and figure extraction achieves a median IoU of 0.997. To support reproducibility, we additionally release the human-annotated ground-truth data, all trained models, and the full dataset generation pipeline.
翻译:摘要:视觉语言模型在眼科领域展现出巨大潜力,但其开发依赖于大规模、高质量的图像-文本数据集,此类数据集仍然稀缺。我们提出PubMed-Ophtha,这是一个层级化数据集,包含从PubMed Central中15,842篇开放获取文章中提取的102,023对眼科图像-标题对。与现有数据集不同,该数据集的图像直接从PDF文档中以全分辨率提取,并分解为组成面板、面板标识符及单个图像。每张图像均标注其成像模态——彩色眼底摄影、光学相干断层扫描、视网膜成像或其他——以及标记状态(如箭头等注释标记的存在)。通过采用两步大语言模型方法,图形标题被拆分为面板级子标题,在人工标注数据上实现了平均句子BLEU得分为0.913。面板检测模型与图像检测模型分别达到0.909和0.892的[email protected],图像提取的中位IoU为0.997。为确保可复现性,我们额外发布了人工标注的真实数据、所有训练完成的模型以及完整的数据集生成流程。