This paper investigates the application of the transformer architecture in protein folding, as exemplified by DeepMind's AlphaFold project, and its implications for the understanding of large language models as models of language. The prevailing discourse often assumes a ready-made analogy between proteins -- encoded as sequences of amino acids -- and natural language -- encoded as sequences of discrete symbols. Instead of assuming as given the linguistic structure of proteins, we critically evaluate this analogy to assess the kind of knowledge-making afforded by the transformer architecture. We first trace the analogy's emergence and historical development, carving out the influence of structural linguistics on structural biology beginning in the mid-20th century. We then examine three often overlooked pre-processing steps essential to the transformer architecture, including subword tokenization, word embedding, and positional encoding, to demonstrate its regime of representation based on continuous, high-dimensional vector spaces, which departs from the discrete, semantically demarcated symbols of language. The successful deployment of transformers in protein folding, we argue, discloses what we consider a non-linguistic approach to token processing intrinsic to the architecture. We contend that through this non-linguistic processing, the transformer architecture carves out unique epistemological territory and produces a new class of knowledge, distinct from established domains. We contend that our search for intelligent machines has to begin with the shape, rather than the place, of intelligence. Consequently, the emerging field of critical AI studies should take methodological inspiration from the history of science in its quest to conceptualize the contributions of artificial intelligence to knowledge-making, within and beyond the domain-specific sciences.
翻译:本文探讨了transformer架构在蛋白质折叠中的应用——以DeepMind的AlphaFold项目为例,以及这对将大语言模型视为语言模型的理解所产生的影响。当前的主流论述通常预设蛋白质(编码为氨基酸序列)与自然语言(编码为离散符号序列)之间存在现成的类比关系。我们不将蛋白质的语言结构视为既定事实,而是批判性评估这一类比,以考察transformer架构所促成的知识生产方式。首先,我们追溯该类比的出现与历史发展,梳理20世纪中期结构语言学对结构生物学的影响。随后,我们审视transformer架构中三个常被忽略的关键预处理步骤——子词分词、词嵌入和位置编码,揭示其基于连续高维向量空间的表征机制,这与语言中离散的、语义界定的符号截然不同。我们认为,transformer在蛋白质折叠中的成功应用,揭示了一种该架构内在的非语言化的词元处理方法。通过这种非语言化处理,transformer架构勾勒出独特的认识论疆域,并产生了一种区别于既有领域的新型知识。我们主张,对智能机器的探寻应从智能的形态(shape)而非位置(place)开始。因此,新兴的关键人工智能研究领域在构想人工智能对知识生产的贡献时(无论是领域内科学还是跨学科科学),应从科学史中汲取方法论灵感。