This paper investigates the application of the transformer architecture in protein folding, as exemplified by DeepMind's AlphaFold project, and its implications for the understanding of so-called large language models. The prevailing discourse often assumes a ready-made analogy between proteins, encoded as sequences of amino acids, and natural language, which we term the language paradigm of computational (structural) biology. Instead of assuming this analogy as given, we critically evaluate it to assess the kind of knowledge-making afforded by the transformer architecture. We first trace the analogy's emergence and historical development, carving out the influence of structural linguistics on structural biology beginning in the mid-20th century. We then examine three often overlooked preprocessing steps essential to the transformer architecture, including subword tokenization, word embedding, and positional encoding, to demonstrate its regime of representation based on continuous, high-dimensional vector spaces, which departs from the discrete nature of language. The successful deployment of transformers in protein folding, we argue, discloses what we consider a non-linguistic approach to token processing intrinsic to the architecture. We contend that through this non-linguistic processing, the transformer architecture carves out unique epistemological territory and produces a new class of knowledge, distinct from established domains. We contend that our search for intelligent machines has to begin with the shape, rather than the place, of intelligence. Consequently, the emerging field of critical AI studies should take methodological inspiration from the history of science in its quest to conceptualize the contributions of artificial intelligence to knowledge-making, within and beyond the domain-specific sciences.
翻译:本文研究了Transformer架构在蛋白质折叠中的应用(以DeepMind的AlphaFold项目为例)及其对理解所谓大型语言模型的意义。主流论述常预设蛋白质(编码为氨基酸序列)与自然语言之间存在现成的类比关系,我们将其称为计算(结构)生物学的语言范式。我们并未直接接受这一类比,而是通过批判性评估来考察Transformer架构所促成的知识生产模式。首先,我们追溯该类比的起源与历史发展,梳理自20世纪中期以来结构语言学对结构生物学的影响。随后,我们检视Transformer架构中三个常被忽视的关键预处理步骤——子词标记化、词嵌入与位置编码——以揭示其基于连续高维向量空间的表征机制,该机制实则背离了语言的离散特性。我们认为,Transformer在蛋白质折叠中的成功应用,揭示了该架构固有的非语言性标记处理方式。通过这种非语言处理,Transformer架构开辟了独特的认识论领域,并产生了一类区别于传统知识范畴的新型知识。我们主张,对智能机器的探索应从智能的形态而非其存在场域出发。因此,新兴的批判性人工智能研究领域应从科学史中汲取方法论启示,以构建能够阐释人工智能在特定学科内外对知识生产贡献的理论框架。