Developing accurate machine learning models for oncology requires large-scale, high-quality multimodal datasets. However, creating such datasets remains challenging due to the complexity and heterogeneity of medical data. To address this challenge, we introduce HoneyBee, a scalable modular framework for building multimodal oncology datasets that leverages foundational models to generate representative embeddings. HoneyBee integrates various data modalities, including clinical records, imaging data, and patient outcomes. It employs data preprocessing techniques and transformer-based architectures to generate embeddings that capture the essential features and relationships within the raw medical data. The generated embeddings are stored in a structured format using Hugging Face datasets and PyTorch dataloaders for accessibility. Vector databases enable efficient querying and retrieval for machine learning applications. We demonstrate the effectiveness of HoneyBee through experiments assessing the quality and representativeness of the embeddings. The framework is designed to be extensible to other medical domains and aims to accelerate oncology research by providing high-quality, machine learning-ready datasets. HoneyBee is an ongoing open-source effort, and the code, datasets, and models are available at the project repository.
翻译:开发针对肿瘤学的精准机器学习模型需要大规模、高质量的多模态数据集。然而,由于医疗数据的复杂性和异质性,构建此类数据集仍面临挑战。为解决这一问题,我们提出HoneyBee——一种可扩展的模块化框架,用于构建多模态肿瘤学数据集,并利用基础模型生成具有代表性的嵌入表示。HoneyBee整合了包括临床记录、影像数据和患者预后在内的多种数据模态,采用数据预处理技术和基于Transformer的架构,生成能够捕获原始医疗数据中关键特征与关系的嵌入表示。生成的嵌入通过Hugging Face数据集和PyTorch数据加载器以结构化格式存储,确保可访问性;向量数据库则为机器学习应用提供高效的查询与检索能力。我们通过评估嵌入质量与代表性的实验验证了HoneyBee的有效性。该框架设计具备可扩展性,能适应其他医学领域,旨在通过提供高质量、可直接用于机器学习的数据集加速肿瘤学研究。HoneyBee是一项持续进行的开源工作,相关代码、数据集及模型均可在项目仓库中获取。