AI models for drug discovery and chemical literature mining must interpret molecular images and generate outputs consistent with 3D geometry and stereochemistry. Most molecular language models rely on strings or graphs, while vision-language models often miss stereochemical details and struggle to map continuous 3D structures into discrete tokens. We propose DeepMoLM: Deep Molecular Language M odeling, a dual-view framework that grounds high-resolution molecular images in geometric invariants derived from molecular conformations. DeepMoLM preserves high-frequency evidence from 1024 $\times$ 1024 inputs, encodes conformer neighborhoods as discrete Extended 3-Dimensional Fingerprints, and fuses visual and geometric streams with cross-attention, enabling physically grounded generation without atom coordinates. DeepMoLM improves PubChem captioning with a 12.3% relative METEOR gain over the strongest generalist baseline while staying competitive with specialist methods. It produces valid numeric outputs for all property queries and attains MAE 13.64 g/mol on Molecular Weight and 37.89 on Complexity in the specialist setting. On ChEBI-20 description generation from images, it exceeds generalist baselines and matches state-of-the-art vision-language models. Code is available at https://github.com/1anj/DeepMoLM.
翻译:药物发现与化学文献挖掘的人工智能模型需能解读分子图像,并生成与三维几何及立体化学一致的结果。现有分子语言模型大多基于字符串或图结构,而视觉-语言模型常忽略立体化学细节,且难以将连续三维结构映射为离散标记。本文提出DeepMoLM:深度分子语言建模,一种双视角框架,将高分辨率分子图像锚定于分子构象衍生的几何不变量中。DeepMoLM保留来自1024 $\times$ 1024输入的高频特征,将构象邻域编码为离散的扩展三维指纹,并通过交叉注意力融合视觉与几何信息流,从而在不依赖原子坐标的情况下实现基于物理规律的生成。在PubChem描述生成任务中,DeepMoLM相比最强的通用基线模型取得12.3%的相对METEOR提升,同时保持与专业方法的竞争力。该模型对所有性质查询均能生成有效的数值输出,在专业设定下分子量预测的MAE为13.64 g/mol,复杂度预测MAE为37.89。在ChEBI-20图像描述生成任务中,其性能超越通用基线模型,并与最先进的视觉-语言模型持平。代码发布于https://github.com/1anj/DeepMoLM。