Context retrieval systems for LLM inference face a critical challenge: high retrieval latency creates a fundamental tension between waiting for complete context (poor time-to-first-token) and proceeding without it (reduced quality). Streaming context incrementally--overlapping retrieval with inference--can mitigate this latency, but doing so with concurrent requests introduces new challenges: requests contend for GPU compute and memory, and scheduling must adapt to dynamic context arrivals. We present Stream2LLM, a streaming-aware LLM serving system for concurrent prefill-decode disaggregated deployments. Stream2LLM introduces adaptive scheduling and preemption for two distinct retrieval patterns: append-mode (progressive context accumulation) and update-mode (iterative refinement with cache invalidation). It decouples scheduling decisions from resource acquisition, enabling flexible preemption strategies guided by hardware-specific cost models, and uses longest common prefix matching to minimize redundant computation when input changes dynamically. To evaluate Stream2LLM, we collect two large-scale, real-world streaming workloads based on web crawling and approximate nearest neighbor search. Our evaluation demonstrates that streaming architecture delivers up to 11x TTFT improvements, with cost-aware scheduling providing critical benefits under memory pressure, all while maintaining throughput parity with non-streaming baselines. Code: https://github.com/rajveerb/stream2llm/tree/mlsys_artifact
翻译:大语言模型推理的上下文检索系统面临一个关键挑战:高检索延迟导致在等待完整上下文(导致首令牌时延过长)与跳过等待直接推理(导致生成质量下降)之间存在根本性矛盾。采用增量式流式传输上下文(将检索过程与推理过程重叠)可缓解此延迟问题,但并发请求处理会引入新挑战:请求争抢GPU计算与内存资源,且调度策略需适应动态到达的上下文。我们提出Stream2LLM——一种面向并发预填充-解码分离部署的流式感知LLM服务系统。Stream2LLM针对两种不同的检索模式引入自适应调度与抢占机制:追加模式(渐进式上下文累积)与更新模式(带缓存失效的迭代优化)。该系统将调度决策与资源获取解耦,结合硬件特定成本模型实现灵活的抢占策略,并采用最长公共前缀匹配机制在输入动态变化时最小化冗余计算。为评估Stream2LLM,我们基于网络爬取与近似最近邻搜索收集了两个大规模真实流式负载。实验表明,流式架构可实现高达11倍的首令牌时延改善,其中成本感知调度在内存压力下展现出关键优势,同时保持与非流式基线相当的吞吐量。代码地址:https://github.com/rajveerb/stream2llm/tree/mlsys_artifact