Selecting compact and informative gene subsets from single-cell transcriptomic data is essential for biomarker discovery, improving interpretability, and cost-effective profiling. However, most existing feature selection approaches either operate as multi-stage pipelines or rely on post hoc feature attribution, making selection and prediction weakly coupled. In this work, we present YOTO (you only train once), an end-to-end framework that jointly identifies discrete gene subsets and performs prediction within a single differentiable architecture. In our model, the prediction task directly guides which genes are selected, while the learned subsets, in turn, shape the predictive representation. This closed feedback loop enables the model to iteratively refine both what it selects and how it predicts during training. Unlike existing approaches, YOTO enforces sparsity so that only the selected genes contribute to inference, eliminating the need to train additional downstream classifiers. Through a multi-task learning design, the model learns shared representations across related objectives, allowing partially labeled datasets to inform one another, and discovering gene subsets that generalize across tasks without additional training steps. We evaluate YOTO on two representative single-cell RNA-seq datasets, showing that it consistently outperforms state-of-the-art baselines. These results demonstrate that sparse, end-to-end, multi-task gene subset selection improves predictive performance and yields compact and meaningful gene subsets, advancing biomarker discovery and single-cell analysis.
翻译:从单细胞转录组数据中选择紧凑且信息丰富的基因子集对于生物标志物发现、提高可解释性以及实现经济高效的基因谱分析至关重要。然而,现有的大多数特征选择方法要么作为多阶段流程运行,要么依赖于事后特征归因,导致选择与预测之间耦合较弱。在本研究中,我们提出了YOTO(一次性训练)框架,这是一种端到端的架构,可在单一可微分模型中联合识别离散基因子集并执行预测任务。在我们的模型中,预测任务直接指导基因选择过程,而学习到的子集反过来又塑造预测表征。这种闭环反馈机制使模型能够在训练过程中迭代优化其选择策略与预测方式。与现有方法不同,YOTO通过稀疏性约束确保仅被选基因参与推理过程,无需训练额外的下游分类器。通过多任务学习设计,该模型能够学习跨相关目标的共享表征,使部分标注数据集能够相互补充信息,并发现可跨任务泛化的基因子集而无需额外训练步骤。我们在两个具有代表性的单细胞RNA-seq数据集上评估YOTO,结果表明其性能持续优于现有先进基线方法。这些发现证明,稀疏、端到端、多任务的基因子集选择策略不仅能提升预测性能,还能产生紧凑且具有生物学意义的基因子集,从而推动生物标志物发现与单细胞分析领域的发展。