Toxicity remains a leading cause of early-stage drug development failure. Despite advances in molecular design and property prediction, the task of molecular toxicity repair, generating structurally valid molecular alternatives with reduced toxicity, has not yet been systematically defined or benchmarked. To fill this gap, we introduce ToxiMol, the first benchmark task for general-purpose Multimodal Large Language Models (MLLMs) focused on molecular toxicity repair. We construct a standardized dataset covering 11 primary tasks and 660 representative toxic molecules spanning diverse mechanisms and granularities. We design a prompt annotation pipeline with mechanism-aware and task-adaptive capabilities, informed by expert toxicological knowledge. In parallel, we propose an automated evaluation framework, ToxiEval, which integrates toxicity endpoint prediction, synthetic accessibility, drug-likeness, and structural similarity into a high-throughput evaluation chain for repair success. We systematically assess 43 mainstream general-purpose MLLMs and conduct multiple ablation studies to analyze key issues, including evaluation metrics, candidate diversity, and failure attribution. Experimental results show that although current MLLMs still face significant challenges on this task, they begin to demonstrate promising capabilities in toxicity understanding, semantic constraint adherence, and structure-aware editing.
翻译:毒性仍是药物早期研发失败的主要原因。尽管分子设计与性质预测领域已取得进展,分子毒性修复任务——即生成结构有效且毒性降低的分子替代物——尚未被系统性地定义或建立基准。为填补这一空白,我们提出ToxiMol,这是首个专注于分子毒性修复的通用多模态大语言模型(MLLMs)基准任务。我们构建了一个标准化数据集,涵盖11项主要任务和660个代表性有毒分子,涉及多种作用机制与粒度层次。基于毒理学专家知识,我们设计了具备机制感知与任务自适应能力的提示标注流程。同时,我们提出了自动化评估框架ToxiEval,该框架将毒性终点预测、合成可行性、类药性及结构相似性整合为高通量的修复成功率评估链。我们系统评估了43种主流通用MLLMs,并通过多项消融实验分析了关键问题,包括评估指标、候选分子多样性及失败归因。实验结果表明,尽管当前MLLMs在此任务上仍面临显著挑战,但已初步展现出在毒性理解、语义约束遵循及结构感知编辑方面的潜力。