Large Language Model (LLM) inference is rapidly becoming a core datacenter service, yet current serving stacks keep the host CPU on the critical path for orchestration and token-level control. This makes LLM performance sensitive to CPU interference, undermining application colocation and forcing operators to reserve CPU headroom, leaving substantial capacity unutilized. We introduce Blink, an end-to-end serving architecture that removes the host CPU from the steady-state inference path by redistributing responsibilities across a SmartNIC and a GPU. Blink offloads request handling to the SmartNIC, which delivers inputs directly into GPU memory via RDMA, and replaces host-driven scheduling with a persistent GPU kernel that performs batching, scheduling, and KV-cache management without CPU involvement. Evaluated against TensorRT-LLM, vLLM, and SGLang, Blink outperforms all baselines even in isolation, reducing pre-saturation P99 TTFT by up to 8.47$\times$ and P99 TPOT by up to 3.40$\times$, improving decode throughput by up to 2.1$\times$, and reducing energy per token by up to 48.6$\%$. Under CPU interference, Blink maintains stable performance, while existing systems degrade by up to two orders of magnitude.
翻译:大语言模型(LLM)推理正迅速成为数据中心的核心服务,然而当前的服务栈将主机CPU置于编排和令牌级控制的路径关键节点上。这使得LLM性能对CPU干扰高度敏感,破坏了应用协同部署,迫使运维人员预留CPU裕量,导致大量算力未被充分利用。我们提出Blink——一种端到端服务架构,通过将职责重新分配至智能网卡和GPU,将主机CPU从稳态推理路径中移除。Blink将请求处理卸载至智能网卡,通过RDMA将输入直接送入GPU内存;并以持久化GPU内核替代主机驱动的调度机制,在无需CPU参与的情况下完成批处理、调度及KV缓存管理。在与TensorRT-LLM、vLLM和SGLang的对比评估中,Blink即使在独立运行时也优于所有基线方案:将饱和前P99 TTFT降低多达8.47倍、P99 TPOT降低多达3.40倍,解码吞吐率提升多达2.1倍,每令牌能耗降低多达48.6%。在CPU干扰场景下,Blink保持稳定性能,而现有系统的性能退化幅度高达两个数量级。