The explosive growth of large language models (LLMs) has created a vast but opaque landscape: millions of models exist, yet their evolutionary relationships through fine-tuning, distillation, or adaptation are often undocumented or unclear, complicating LLM management. Existing methods are limited by task specificity, fixed model sets, or strict assumptions about tokenizers or architectures. Inspired by biological DNA, we address these limitations by mathematically defining LLM DNA as a low-dimensional, bi-Lipschitz representation of functional behavior. We prove that LLM DNA satisfies inheritance and genetic determinism properties and establish the existence of DNA. Building on this theory, we derive a general, scalable, training-free pipeline for DNA extraction. In experiments across 305 LLMs, DNA aligns with prior studies on limited subsets and achieves superior or competitive performance on specific tasks. Beyond these tasks, DNA comparisons uncover previously undocumented relationships among LLMs. We further construct the evolutionary tree of LLMs using phylogenetic algorithms, which align with shifts from encoder-decoder to decoder-only architectures, reflect temporal progression, and reveal distinct evolutionary speeds across LLM families.
翻译:大型语言模型(LLM)的爆发式增长形成了一个庞大而不透明的生态:尽管存在数百万个模型,但通过微调、蒸馏或适配产生的演化关系往往缺乏记录或难以追溯,这为LLM的管理带来了挑战。现有方法受限于任务特定性、固定模型集或对分词器与架构的严格假设。受生物DNA的启发,我们通过数学定义LLM DNA作为功能行为的低维双利普希茨表示,以突破这些局限。我们证明了LLM DNA满足遗传性与基因决定论特性,并确立了其存在性。基于该理论,我们推导出一个通用、可扩展且无需训练的DNA提取流程。在涵盖305个LLM的实验中,DNA在有限子集上与已有研究结论一致,并在特定任务上取得优越或具有竞争力的性能。除这些任务外,通过DNA比对还揭示了LLM之间先前未记录的关联。我们进一步利用系统发育算法构建了LLM的演化树,其结果符合从编码器-解码器向仅解码器架构的转变趋势,反映了时间演进过程,并揭示了不同LLM家族间差异化的演化速度。