Developing accurate machine learning models for oncology requires large-scale, high-quality multimodal datasets. However, creating such datasets remains challenging due to the complexity and heterogeneity of medical data. To address this challenge, we introduce HoneyBee, a scalable modular framework for building multimodal oncology datasets that leverages foundation models to generate representative embeddings. HoneyBee integrates various data modalities, including clinical diagnostic and pathology imaging data, medical notes, reports, records, and molecular data. It employs data preprocessing techniques and foundation models to generate embeddings that capture the essential features and relationships within the raw medical data. The generated embeddings are stored in a structured format using Hugging Face datasets and PyTorch dataloaders for accessibility. Vector databases enable efficient querying and retrieval for machine learning applications. We demonstrate the effectiveness of HoneyBee through experiments assessing the quality and representativeness of these embeddings. The framework is designed to be extensible to other medical domains and aims to accelerate oncology research by providing high-quality, machine learning-ready datasets. HoneyBee is an ongoing open-source effort, and the code, datasets, and models are available at the project repository.
翻译:开发精准的肿瘤学机器学习模型需要大规模、高质量的多模态数据集。然而,由于医学数据的复杂性和异质性,创建此类数据集仍然具有挑战性。为应对这一挑战,我们引入了HoneyBee,这是一个可扩展的模块化框架,用于构建多模态肿瘤学数据集,其利用基础模型生成具有代表性的嵌入。HoneyBee整合了多种数据模态,包括临床诊断与病理影像数据、医疗笔记、报告、记录以及分子数据。它采用数据预处理技术和基础模型来生成嵌入,这些嵌入能够捕捉原始医学数据中的本质特征和关联。生成的嵌入使用Hugging Face数据集和PyTorch数据加载器以结构化格式存储,以确保可访问性。向量数据库支持针对机器学习应用的高效查询与检索。我们通过评估这些嵌入的质量与代表性的实验,证明了HoneyBee的有效性。该框架设计为可扩展至其他医学领域,旨在通过提供高质量、可直接用于机器学习的数据集来加速肿瘤学研究。HoneyBee是一个持续进行的开源项目,其代码、数据集和模型可在项目仓库中获取。