We demonstrate that the assembly pathway method underlying ``Assembly Theory" (AT) is a suboptimal restricted version of Huffman's encoding (Shannon-Fano type) for `counting copies,' the stated objective of the authors of AT, introduced in computer science in the 1960s and widely used by popular statistical and computable compression algorithms that have been applied to all sort of biosignatures before. We show how simple modular instructions can mislead AT, leading to failure to accomplish what the authors originally intended (counting the `number of copies') or to capture subtleties, beyond very trivial statistical properties of biological systems. We present cases whose low complexity can arbitrarily diverge from the random-like appearance to which the AT would assign arbitrarily high statistical significance, and show that it fails in simple cases (synthetic or natural) which the assembly theory was supposed to shed some light on. Our theoretical and empirical results imply that the assembly index, whose computable nature is not an advantage, does not offer any substantial improvement over existing concepts and methods, computable or (semi) uncomputable. No strong compression or algorithmic complexity results were required to prove that AT and MA are ill-defined and under-perform as compared to simple coding schemes. We show that despite the claims of experimental data, the assembly measure is driven mostly or only by InChI codes which had already been reported before to discriminate organic from inorganic compounds by other indexes.
翻译:我们证明,“组装理论”(AT)所依据的组装路径方法,是计算机科学于20世纪60年代提出、并已广泛应用于各类生物标志物统计与可计算压缩算法的霍夫曼编码(香农-法诺型)在“副本计数”这一AT作者所述目标上的次优受限版本。我们展示简单的模块化指令如何误导AT,导致其既无法实现作者最初设定的目标(计算“副本数量”),也无法捕捉超越生物学系统极简单统计特性的微妙之处。我们呈现了低复杂性可任意偏离随机外观的案例,而AT会为此类随机外观赋予任意高的统计显著性,并证明其在原本应予以阐明的人造或自然简单案例中失效。理论与实证结果表明,尽管组装指数具有可计算性这一非优势特性,但相对于现有概念与方法(无论可计算或半不可计算),该指数并未提供实质性改进。无需借助强压缩或算法复杂性结果,即可证明AT与MA定义不当且性能劣于简单编码方案。我们揭示,尽管有实验数据声称,组装测度主要甚或仅受InChI码驱动——而此前已有其他指数通过此类编码区分有机与无机化合物。