Large genomic and imaging datasets can be used to train models that learn meaningful representations of cellular systems. Across domains, model performance improves predictably with dataset size and compute budget, providing a basis for allocating data and computation. Scientific data, however, is also limited by noise arising from factors such as molecular undersampling, sequencing errors, and image resolution. By fitting 1,670 representation learning models across three data modalities (gene expression, sequence, and image data), we show that noise defines a distinct axis along which performance improves. Noise scaling follows a logarithmic law. We derive the law from a model of noise propagation, and use it to define noise sensitivity and model capacity as benchmarking metrics. We show that protein sequence representations are noise-robust while single cell transcriptomics models are not, with a Transformer-based model showing greater noise robustness but lower saturating performance than a variational autoencoder model. Noise scaling metrics may support future model evaluation and experimental design.
翻译:大型基因组和成像数据集可用于训练模型,使其学习细胞系统的有意义表征。在不同领域中,模型性能随数据集规模和计算预算的增加呈现可预测的提升,这为数据和计算资源的分配提供了依据。然而,科学数据还受到分子欠采样、测序错误和图像分辨率等因素产生的噪声限制。通过在三类数据模态(基因表达、序列和图像数据)上拟合1,670个表征学习模型,我们证明噪声定义了性能提升的独立维度。噪声缩放遵循对数定律。我们从噪声传播模型中推导出该定律,并利用它定义噪声敏感性和模型容量作为基准测试指标。研究表明,蛋白质序列表征具有噪声鲁棒性,而单细胞转录组学模型则不然;其中基于Transformer的模型表现出比变分自编码器模型更强的噪声鲁棒性,但饱和性能较低。噪声缩放指标可为未来模型评估和实验设计提供支持。