Inference in large-scale AI models is typically performed on dense parameter matrices, leading to inference cost and system complexity that scale unsustainably with model size. This limitation does not arise from insufficient model capacity, but from treating post-training inference systems as monolithic operators while ignoring internal structures formed during learning. We show that gradient update events in large models are highly localized and selective, leaving many parameter dependencies statistically indistinguishable from their initialization distribution after training. As a result, post-training inference systems are structurally non-uniform and inherently decomposable. Based on this observation, we introduce a post-training statistical criterion and a structural annealing procedure that removes unsupported dependencies and reveals stable, independent substructures. This work establishes a post-training, model-agnostic structural view of inference systems and enables structured, parallel inference without modifying model functionality or interfaces.
翻译:大规模人工智能模型的推理通常在稠密参数矩阵上进行,这导致推理成本和系统复杂度随模型规模呈不可持续增长。这种限制并非源于模型容量不足,而是由于将训练后推理系统视为整体算子,而忽略了学习过程中形成的内部结构。我们证明大型模型中的梯度更新事件具有高度局部性和选择性,使得大量参数依赖关系在训练后与其初始化分布在统计上无法区分。因此,训练后推理系统在结构上呈现非均匀性,并天然具备可分解特性。基于此发现,我们提出一种训练后统计准则和结构退火流程,可消除无统计支持的依赖关系并揭示稳定的独立子结构。本研究建立了训练后推理系统的模型无关结构化视角,能够在保持模型功能与接口不变的前提下实现结构化并行推理。