Modern LLM training pipelines increasingly rely on other models to generate data, filter corpora, judge outputs, and guide development decisions. These dependencies are recursive: a model may depend on an upstream artifact whose own dependencies are documented only in separate releases and artifacts. As a result, the full dependency structure is fragmented across heterogeneous public artifacts, with complexity and recursive depth far outpacing humans' ability to trace. We introduce ModSleuth, an agentic system that recursively reconstructs LLM dependency graphs from public artifacts with source-grounded evidence. We find that the primary challenge is no longer information extraction, but defining what constitutes a dependency and reconciling artifact references across inconsistent documentation. We address these challenges through a formalization that distinguishes direct and indirect dependencies, represents heterogeneous pipeline roles through operation-centered relationships, and resolves artifact identities across names, versions, and repositories. Applying ModSleuth to four public-artifact-rich LLM releases, we recover 1,060 source-verified dependencies and construct large-scale dependency graphs of modern LLM development. These graphs reveal multi-hop license obligations, train-evaluation coupling, discrepancies between released and training-time artifacts, and documentation inconsistencies that would otherwise be difficult to uncover. We release ModSleuth and the resulting dependency graphs to support transparent analysis of the increasingly complex ecosystems underlying modern LLMs.
翻译:摘要:现代大语言模型(LLM)的训练流水线日益依赖其他模型来生成数据、过滤语料库、评判输出以及指导开发决策。这些依赖关系具有递归性:一个模型可能依赖于某个上游工件,而该工件自身的依赖关系仅记录在独立的发布版本和工件中。因此,完整的依赖结构分散于异构的公共工件之中,其复杂性和递归深度远超人类追踪能力。我们提出ModSleuth——一个代理系统,能够从公共工件中递归重构LLM依赖图,并提供基于源头的证据。研究发现,主要挑战已不再是信息提取,而是如何定义依赖关系以及如何在不一致的文档中协调工件引用。我们通过形式化方法应对这些挑战:区分直接依赖与间接依赖,通过以操作为中心的关系表示异构流水线角色,并在名称、版本和代码库间解析工件标识。将ModSleuth应用于四个富含公共工件的LLM发布版本后,我们恢复了1060个经过源头验证的依赖关系,并构建了现代LLM开发的大规模依赖图。这些图揭示了多跳许可义务、训练-评估耦合、发布时工件与训练时工件间的差异,以及难以通过其他方式发现的文档不一致性。我们发布ModSleuth及其生成的依赖图,以支持对现代LLM日益复杂生态系统的透明分析。