Machine learning methods for identifying molecular structures from tandem mass spectra (MS/MS) have advanced rapidly, yet current approaches still exhibit significant error rates. In high-stakes applications such as clinical metabolomics and environmental screening, incorrect annotations can have serious consequences, making it essential to determine when a prediction can be trusted. We introduce a selective prediction framework for molecular structure retrieval from MS/MS spectra, enabling models to abstain from predictions when uncertainty is too high. We formulate the problem within the risk-coverage tradeoff framework and comprehensively evaluate uncertainty quantification strategies at two levels of granularity: fingerprint-level uncertainty over predicted molecular fingerprint bits, and retrieval-level uncertainty over candidate rankings. We compare scoring functions including first-order confidence measures, aleatoric and epistemic uncertainty estimates from second-order distributions, as well as distance-based measures in the latent space. All experiments are conducted on the MassSpecGym benchmark. Our analysis reveals that while fingerprint-level uncertainty scores are poor proxies for retrieval success, computationally inexpensive first-order confidence measures and retrieval-level aleatoric uncertainty achieve strong risk-coverage tradeoffs across evaluation settings. We demonstrate that by applying distribution-free risk control via generalization bounds, practitioners can specify a tolerable error rate and obtain a subset of annotations satisfying that constraint with high probability.
翻译:从串联质谱(MS/MS)中识别分子结构的机器学习方法发展迅速,但现有方法仍存在显著错误率。在临床代谢组学和环境筛查等高风险应用中,错误的注释可能带来严重后果,因此确定何时可以信任预测至关重要。我们提出了一种从MS/MS谱图中检索分子结构的选择性预测框架,使模型在不确定性过高时能够拒绝预测。我们将该问题置于风险-覆盖权衡框架中,并在两个粒度级别全面评估不确定性量化策略:预测分子指纹位点的指纹级不确定性,以及候选排序的检索级不确定性。我们比较了多种评分函数,包括一阶置信度度量、来自二阶分布的偶然性和认知性不确定性估计,以及潜在空间中的距离度量。所有实验均在MassSpecGym基准测试集上进行。分析表明,虽然指纹级不确定性评分难以有效反映检索成功率,但计算成本低廉的一阶置信度度量和检索级偶然性不确定性在各类评估场景中均实现了优异的风险-覆盖权衡。我们通过泛化边界应用无分布风险控制方法证明,实践者可以指定可容忍的错误率,并以高概率获得满足该约束的注释子集。