模型服务论文 - 专知

会员服务 ·

模型服务

FlowPrefill: Decoupling Preemption from Prefill Scheduling Granularity to Mitigate Head-of-Line Blocking in LLM Serving

Arxiv

0+阅读 · 2月18日

OServe: Accelerating LLM Serving via Spatial-Temporal Workload Orchestration

Arxiv

0+阅读 · 2月12日

Nightjar: Dynamic Adaptive Speculative Decoding for Large Language Models Serving

Arxiv

0+阅读 · 2月16日

LLM Serving Optimization with Variable Prefill and Decode Lengths

Arxiv

0+阅读 · 2月10日

MUSE: Multi-Tenant Model Serving With Seamless Model Updates

Arxiv

0+阅读 · 2月12日

PlanetServe: A Decentralized, Scalable, and Privacy-Preserving Overlay for Democratizing Large Language Model Serving

Arxiv

0+阅读 · 2月13日

DualMap: Enabling Both Cache Affinity and Load Balancing for Distributed LLM Serving

Arxiv

0+阅读 · 2月6日

PASCAL: A Phase-Aware Scheduling Algorithm for Serving Reasoning-based Large Language Models

Arxiv

0+阅读 · 2月12日

REASONING COMPILER: LLM-Guided Optimizations for Efficient Model Serving

Arxiv

0+阅读 · 2月4日

Towards Resiliency in Large Language Model Serving with KevlarFlow

Arxiv

0+阅读 · 1月30日

A Universal Load Balancing Principle and Its Application to Large Language Model Serving

Arxiv

0+阅读 · 2月1日

Taming the Memory Footprint Crisis: System Design for Production Diffusion LLM Serving

Arxiv

0+阅读 · 1月14日

A Universal Load Balancing Principle and Its Application to Large Language Model Serving

Arxiv

0+阅读 · 1月25日

WISP: Waste- and Interference-Suppressed Distributed Speculative LLM Serving at the Edge via Dynamic Drafting and SLO-Aware Batching

Arxiv

0+阅读 · 1月15日

Cache Your Prompt When It's Green: Carbon-Aware Caching for Large Language Model Serving

Arxiv

0+阅读 · 1月19日

参考链接

微信扫码咨询专知VIP会员