We introduce the elEmBERT model for chemical classification tasks. It is based on deep learning techniques, such as a multilayer encoder architecture. We demonstrate the opportunities offered by our approach on sets of organic, inorganic and crystalline compounds. In particular, we developed and tested the model using the Matbench and Moleculenet benchmarks, which include crystal properties and drug design-related benchmarks. We also conduct an analysis of vector representations of chemical compounds, shedding light on the underlying patterns in structural data. Our model exhibits exceptional predictive capabilities and proves universally applicable to molecular and material datasets. For instance, on the Tox21 dataset, we achieved an average precision of 96%, surpassing the previously best result by 10%.
翻译:我们提出了用于化学分类任务的elEmBERT模型。该模型基于深度学习技术,采用多层编码器架构。我们在有机化合物、无机化合物和晶体化合物数据集上展示了本方法的优势。具体而言,我们使用包含晶体性质和药物设计相关基准的Matbench与Moleculenet基准测试集开发并验证了该模型。同时,我们对化合物的向量表示进行了深入分析,揭示了结构数据中的潜在规律。该模型展现出卓越的预测能力,并证明可普遍适用于分子与材料数据集。例如在Tox21数据集上,我们实现了96%的平均精确率,较先前最佳结果提升了10%。