Tandem mass spectrometry (MS/MS) is central to small molecule identification, but current deep learning systems for spectrum prediction still remain difficult to evaluate and deploy in practice. While novel architectures constantly claim state-of-the-art performance, inconsistent metadata conditioning and entangled preprocessing pipelines hinder fair architectural comparisons. Besides, existing evaluations are often restricted to curated datasets, failing to capture the heterogeneity and cross-domain shifts of real-world metabolomics. Furthermore, current benchmarks lack difficulty-aware diagnostics and leave blind to how models behave under specific compute or data constraints. To address this, we present FlexMS, a modular public-data benchmark framework that standardizes MS/MS prediction across public resources while keeping molecular encoders, metadata conditioning, predictor heads, and downstream retrieval under one protocol. FlexMS establishes a fair evaluation playground which significantly lowers the barrier for integrating new predictive tools. Rather than solely optimizing for average scores, FlexMS augments aggregate accuracy with difficulty-aware diagnostics, providing actionable guidance on model selection across different compute constraints, data scales, and downstream retrieval objectives. Ultimately, FlexMS provides the community with a reproducible standard to identify which algorithmic conclusions are stable and which operating points are most viable in practice.
翻译:串联质谱(MS/MS)是小分子鉴定的核心技术,但当前用于谱图预测的深度学习系统在实践中仍难以评估和部署。尽管新型架构不断声称达到最先进性能,但不一致的元数据条件化与交织的预处理流程阻碍了公正的架构比较。此外,现有评估常局限于精选数据集,无法捕捉真实代谢组学中的异质性与跨领域偏移。更关键的是,当前基准缺乏难度感知诊断能力,且对模型在特定计算或数据约束下的行为认知盲区。为此,我们提出FlexMS——一个模块化的公共数据基准框架,在统一协议下规范公共资源的MS/MS预测,涵盖分子编码器、元数据条件化、预测头及下游检索模块。FlexMS建立了公平评估平台,显著降低集成新预测工具的门槛。该框架不仅优化平均分数,更以难度感知诊断增强聚合精度,为不同计算约束、数据规模及下游检索目标下的模型选择提供可操作指导。最终,FlexMS为学界提供可复现的标准,以识别哪些算法结论具有稳定性,以及哪些操作点在实践中最为可行。