PubMed-OCR is an OCR-centric corpus of scientific articles derived from PubMed Central Open Access PDFs. Each page image is annotated with Google Cloud Vision and released in a compact JSON schema with word-, line-, and paragraph-level bounding boxes. The corpus spans 209.5K articles (1.5M pages; ~1.3B words) and supports layout-aware modeling, coordinate-grounded QA, and evaluation of OCR-dependent pipelines. We analyze corpus characteristics (e.g., journal coverage and detected layout features) and discuss limitations, including reliance on a single OCR engine and heuristic line reconstruction. We release the data and schema to facilitate downstream research and invite extensions.
翻译:PubMed-OCR是一个以光学字符识别(OCR)为核心的科学文献语料库,其数据来源于PubMed Central开放获取的PDF文档。每页图像均通过Google Cloud Vision进行标注,并以紧凑的JSON格式发布,其中包含单词级、行级和段落级的边界框坐标。该语料库涵盖20.95万篇学术文献(共计150万页,约13亿单词),支持面向版面布局的建模、基于坐标定位的问答任务以及依赖OCR的技术流程评估。本文分析了该语料库的特征(如期刊覆盖范围和检测到的版面特征),并讨论了其局限性,包括对单一OCR引擎的依赖以及启发式行重建方法的不足。我们公开了全部数据与架构规范,以促进下游研究,并欢迎后续扩展工作。