A Large-Scale Vision-Language Dataset Derived from Open Scientific Literature to Advance Biomedical Generalist AI

Alejandro Lozano,Min Woo Sun,James Burgess,Jeffrey J. Nirschl,Christopher Polzak,Yuhui Zhang,Liangyu Chen,Jeffrey Gu,Ivan Lopez,Josiah Aklilu,Anita Rau,Austin Wolfgang Katzer,Collin Chiu,Orr Zohar,Xiaohan Wang,Alfred Seunghoon Song,Chiang Chia-Chun,Robert Tibshirani,Serena Yeung-Levy

Despite the excitement behind biomedical artificial intelligence (AI), access to high-quality, diverse, and large-scale data - the foundation for modern AI systems - is still a bottleneck to unlocking its full potential. To address this gap, we introduce Biomedica, an open-source dataset derived from the PubMed Central Open Access subset, containing over 6 million scientific articles and 24 million image-text pairs, along with 27 metadata fields (including expert human annotations). To overcome the challenges of accessing our large-scale dataset, we provide scalable streaming and search APIs through a web server, facilitating seamless integration with AI systems. We demonstrate the utility of the Biomedica dataset by building embedding models, chat-style models, and retrieval-augmented chat agents. Notably, all our AI models surpass previous open systems in their respective categories, underscoring the critical role of diverse, high-quality, and large-scale biomedical data.

翻译：尽管生物医学人工智能（AI）备受瞩目，但获取高质量、多样化且大规模的数据——现代AI系统的基石——仍然是释放其全部潜力的瓶颈。为填补这一空白，我们推出了Biomedica，这是一个源自PubMed Central开放获取子集的开源数据集，包含超过600万篇科学文章和2400万图像-文本对，以及27个元数据字段（包括专家人工标注）。为应对访问我们大规模数据集的挑战，我们通过Web服务器提供了可扩展的流式传输和搜索API，便于与AI系统无缝集成。我们通过构建嵌入模型、聊天式模型和检索增强型聊天代理，展示了Biomedica数据集的实用性。值得注意的是，我们所有的AI模型在各自类别中均超越了先前的开放系统，这凸显了多样化、高质量和大规模生物医学数据的关键作用。

相关内容

关注 7110

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日