Small molecules play a critical role in the biomedical, environmental, and agrochemical domains, each with distinct physicochemical requirements and success criteria. Although biomedical research benefits from extensive datasets and established benchmarks, agrochemical data remain scarce, particularly with respect to species-specific toxicity. This work focuses on ApisTox, the most comprehensive dataset of experimentally validated chemical toxicity to the honey bee (Apis mellifera), an ecologically vital pollinator. We evaluate ApisTox using a diverse suite of machine learning approaches, including molecular fingerprints, graph kernels, and graph neural networks, as well as pretrained models. Comparative analysis with medicinal datasets from the MoleculeNet benchmark reveals that ApisTox represents a distinct chemical space. Performance degradation on non-medicinal datasets, such as ApisTox, demonstrates their limited generalizability of current state-of-the-art algorithms trained solely on biomedical data. Our study highlights the need for more diverse datasets and for targeted model development geared toward the agrochemical domain.
翻译:小分子在生物医学、环境和农化领域发挥着关键作用,每个领域都有独特的理化要求和成功标准。尽管生物医学研究受益于广泛的数据集和既定基准,但农化数据仍然稀缺,特别是在物种特异性毒性方面。本研究聚焦于ApisTox——目前最全面的实验验证化学物质对生态关键传粉者蜜蜂(Apis mellifera)毒性的数据集。我们采用多种机器学习方法评估ApisTox,包括分子指纹、图核、图神经网络以及预训练模型。与MoleculeNet基准中的医药数据集对比分析表明,ApisTox代表了独特的化学空间。在非医药数据集(如ApisTox)上的性能下降,揭示了当前仅基于生物医学数据训练的最先进算法泛化能力有限。我们的研究强调需要更多样化的数据集,以及针对农化领域的定向模型开发。