Recent benchmarks reveal that models for single-cell perturbation response are often outperformed by simply predicting the dataset mean. We trace this anomaly to a metric artifact: control-referenced deltas and unweighted error metrics reward mode collapse whenever the control is biased or the biological signal is sparse. Large-scale \textit{in silico} simulations and analysis of two real-world perturbation datasets confirm that shared reference shifts, not genuine biological change, drives high performance in these evaluations. We introduce differentially expressed gene (DEG)-aware metrics, weighted mean-squared error (WMSE) and weighted delta $R^{2}$ ($R^{2}_{w}(\Delta)$) with respect to all perturbations, that measure error in niche signals with high sensitivity. We further introduce negative and positive performance baselines to calibrate these metrics. With these improvements, the mean baseline sinks to null performance while genuine predictors are correctly rewarded. Finally, we show that using WMSE as a loss function reduces mode collapse and improves model performance.
翻译:近期基准测试表明,单细胞扰动响应模型的预测效果常常不如直接预测数据集均值。我们将此异常现象归因于度量指标的人为缺陷:当对照组存在偏差或生物信号稀疏时,基于对照组的差值指标和未加权误差指标会奖励模式坍塌现象。大规模计算机模拟及两个真实世界扰动数据集的分析证实,这些评估中的优异表现主要源于共享参照偏移,而非真实的生物学变化。我们引入了差异表达基因(DEG)感知的加权指标——针对所有扰动的加权均方误差(WMSE)和加权差值$R^{2}$($R^{2}_{w}(\Delta)$),这些指标能以高灵敏度测量微环境信号中的误差。我们进一步引入负向和正向性能基线来校准这些指标。通过这些改进,均值基线的表现降至无效水平,而真正的预测模型则能获得准确评估。最后,我们证明使用WMSE作为损失函数可减少模式坍塌并提升模型性能。