This paper presents FlowCyt, the first comprehensive benchmark for multi-class single-cell classification in flow cytometry data. The dataset comprises bone marrow samples from 30 patients, with each cell characterized by twelve markers. Ground truth labels identify five hematological cell types: T lymphocytes, B lymphocytes, Monocytes, Mast cells, and Hematopoietic Stem/Progenitor Cells (HSPCs). Experiments utilize supervised inductive learning and semi-supervised transductive learning on up to 1 million cells per patient. Baseline methods include Gaussian Mixture Models, XGBoost, Random Forests, Deep Neural Networks, and Graph Neural Networks (GNNs). GNNs demonstrate superior performance by exploiting spatial relationships in graph-encoded data. The benchmark allows standardized evaluation of clinically relevant classification tasks, along with exploratory analyses to gain insights into hematological cell phenotypes. This represents the first public flow cytometry benchmark with a richly annotated, heterogeneous dataset. It will empower the development and rigorous assessment of novel methodologies for single-cell analysis.
翻译:本文提出了FlowCyt,这是首个针对流式细胞术数据中多类单细胞分类的综合性基准测试。该数据集包含来自30名患者的骨髓样本,每个细胞由十二种标记物表征。真实标注标签识别五种血液细胞类型:T淋巴细胞、B淋巴细胞、单核细胞、肥大细胞以及造血干细胞/祖细胞。实验采用监督式归纳学习和半监督式直推学习方法,每个患者最多处理100万个细胞。基线方法包括高斯混合模型、XGBoost、随机森林、深度神经网络和图神经网络。图神经网络通过利用图编码数据中的空间关系展现了优越性能。该基准测试允许对临床相关分类任务进行标准化评估,并结合探索性分析以深入了解血液细胞表型。这是首个包含丰富注释的异质性数据集的公开流式细胞术基准测试,将推动单细胞分析新方法的开发与严格评估。