One of the central challenges in the computational analysis of liquid chromatography-tandem mass spectrometry (LC-MS/MS) data is to identify the compounds underlying the output spectra. In recent years, this problem is increasingly tackled using deep learning methods. A common strategy involves predicting a molecular fingerprint vector from an input mass spectrum, which is then used to search for matches in a chemical compound database. While various loss functions are employed in training these predictive models, their impact on model performance remains poorly understood. In this study, we investigate commonly used loss functions, deriving novel regret bounds that characterize when Bayes-optimal decisions for these objectives must diverge. Our results reveal a fundamental trade-off between the two objectives of (1) fingerprint similarity and (2) molecular retrieval. Optimizing for more accurate fingerprint predictions typically worsens retrieval results, and vice versa. Our theoretical analysis shows this trade-off depends on the similarity structure of candidate sets, providing guidance for loss function and fingerprint selection.
翻译:液相色谱-串联质谱(LC-MS/MS)数据计算分析的核心挑战之一,是识别输出谱图背后的化合物。近年来,该问题越来越多地通过深度学习方法解决。一种常见策略涉及从输入质谱预测分子指纹向量,随后利用该向量在化合物数据库中搜索匹配项。尽管在训练这些预测模型时采用了多种损失函数,但它们对模型性能的影响仍缺乏深入理解。在本研究中,我们探究了常用损失函数,推导出新颖的遗憾界,以刻画这些目标对应的贝叶斯最优决策何时必然发生分歧。研究结果表明:(1)指纹相似度与(2)分子检索这两个目标之间存在根本性权衡。优化更精确的指纹预测通常会恶化检索结果,反之亦然。我们的理论分析表明,这种权衡取决于候选集的相似性结构,为损失函数和指纹选择提供了指导依据。