Agentic LLM inference with long contexts is increasingly limited by memory bandwidth rather than compute. In this setting, SwiGLU MLP blocks, whose large weights exceed cache capacity, become a major yet under-optimized bottleneck. We propose DeepFusionKernel, a deeply fused kernel that cuts HBM traffic and boosts cache reuse, delivering up to 13.2% speedup on H100 and 9.7% on A100 over SGLang. Integrated with SGLang and paired with a kernel scheduler, DeepFusionKernel ensures consistent accelerations over generation lengths, while remaining adaptable to diverse models, inference configurations, and hardware platforms.
翻译:在具有长上下文的智能体式LLM推理中,内存带宽而非计算能力日益成为限制因素。在此背景下,SwiGLU MLP模块因其庞大的权重超出缓存容量,成为一个主要但尚未充分优化的瓶颈。我们提出DeepFusionKernel,一种深度融合的内核,可减少HBM流量并提升缓存复用率,在H100上相比SGLang实现高达13.2%的加速,在A100上实现9.7%的加速。该内核与SGLang集成,并配合内核调度器使用,能在生成长度范围内保持稳定的加速效果,同时保持对多样化模型、推理配置和硬件平台的适应性。