Algebraic data types (ADTs) let a representation specify at the type level what molecular values are valid and what transformations are meaningful. We propose a molecular representation as a family of typed ADTs that separates (i) constitution (Dietz style bonding systems), (ii) 3D coordinates and stereochemistry, and (iii) electronic structure annotations. This separation makes invariants explicit, supports deterministic local edits, and provides hooks for symmetry aware and Bayesian modeling. These data structures allow us to consider how the representation constrains operations which may be performed over them. Types make invalid manipulations unrepresentable and make it easier to define meaningful priors/likelihoods over generative models (programs with sample and score operations). Unlike string based formats, the ADT exposes chemical structure directly; validity conditions (e.g., valence and symmetry constraints) can be enforced by construction and checked deterministically during transformations. We optionally attach electronic structure annotations (shell/subshell/orbital metadata) to atoms when such information is available; we do not attempt to compute orbitals in this work. We sketch Bayesian probabilistic programming via an integration with LazyPPL, a lazy probabilistic programming library; molecules can be made instances of a group under rotation to support geometric learning settings where molecular properties are invariant under rigid motions and relabellings; and the framework's flexibility is demonstrated through an extension to represent chemical reactions. We provide a Haskell library implementing the representation, released under an OSI approved open source license and archived with a DOI.
翻译:代数数据类型(ADT)允许在类型层面规定哪些分子值是有效的,以及哪些变换具有意义。我们提出一种分子表示方法,即构建一类类型化的ADT族,该族将(i)分子构成(Dietz风格的成键系统)、(ii)三维坐标与立体化学,以及(iii)电子结构注释相互分离。这种分离使得不变量得以显式表达,支持确定性的局部编辑,并为对称性感知建模与贝叶斯建模提供了接口。这些数据结构使我们能够考察表示本身如何约束可对其执行的操作。类型系统使得无效的操作无法被表示,并更容易在生成模型(带有采样与评分操作的程序)上定义有意义的先验/似然函数。与基于字符串的格式不同,ADT直接暴露化学结构;有效性条件(如价态与对称性约束)可通过构造过程强制实施,并在变换过程中进行确定性检查。当相关信息可用时,我们可选择性地为原子附加电子结构注释(壳层/亚壳层/轨道元数据);本文不尝试计算轨道。我们通过集成惰性概率编程库LazyPPL勾勒了贝叶斯概率编程的框架;分子可被构造为旋转群下的实例,以支持分子性质在刚体运动与重标记下保持不变的几何学习场景;并通过扩展表示化学反应的能力展示了该框架的灵活性。我们提供了一个实现该表示的Haskell库,该库采用OSI批准的开源许可证发布,并已分配DOI存档。