Despite the excitement behind biomedical artificial intelligence (AI), access to high-quality, diverse, and large-scale data - the foundation for modern AI systems - is still a bottleneck to unlocking its full potential. To address this gap, we introduce Biomedica, an open-source dataset derived from the PubMed Central Open Access subset, containing over 6 million scientific articles and 24 million image-text pairs, along with 27 metadata fields (including expert human annotations). To overcome the challenges of accessing our large-scale dataset, we provide scalable streaming and search APIs through a web server, facilitating seamless integration with AI systems. We demonstrate the utility of the Biomedica dataset by building embedding models, chat-style models, and retrieval-augmented chat agents. Notably, all our AI models surpass previous open systems in their respective categories, underscoring the critical role of diverse, high-quality, and large-scale biomedical data.
翻译:尽管生物医学人工智能(AI)备受瞩目,但获取高质量、多样化且大规模的数据——现代AI系统的基石——仍然是释放其全部潜力的瓶颈。为填补这一空白,我们推出了Biomedica,这是一个源自PubMed Central开放获取子集的开源数据集,包含超过600万篇科学文章和2400万图像-文本对,以及27个元数据字段(包括专家人工标注)。为应对访问我们大规模数据集的挑战,我们通过Web服务器提供了可扩展的流式传输和搜索API,便于与AI系统无缝集成。我们通过构建嵌入模型、聊天式模型和检索增强型聊天代理,展示了Biomedica数据集的实用性。值得注意的是,我们所有的AI模型在各自类别中均超越了先前的开放系统,这凸显了多样化、高质量和大规模生物医学数据的关键作用。