STREAM: Multi-Tier LLM Inference Middleware with Dual-Channel HPC Token Streaming

Researchers and practitioners working with large language models face a fragmented landscape: local models are free and private but hardware limits the model size and context windows a researcher can use; institutional HPC centers offer powerful GPU resources at no marginal cost and keep data within institutional boundaries, but operate behind firewalls and are designed for batch jobs rather than interactive use; commercial cloud APIs provide frontier-model quality on demand but impose significant cost and data retention policies unsuitable for sensitive research data. No existing system unifies all three. STREAM (Smart Tiered Routing Engine for AI Models) addresses this gap with four contributions: (1) a three-tier routing architecture combining local, HPC, and cloud inference with a local LLM-based complexity judge; (2) a dual-channel HPC streaming architecture that separates the Globus Compute control plane (authentication and job dispatch) from a WebSocket relay data plane (token delivery), enabling sub-second TTFT (0.54 s median, 21.1x over batch mode's 11.40 s) through institutional firewalls without VPN or firewall rule changes, with end-to-end AES-256-GCM encryption ensuring the relay operator cannot read token payloads; (3) tier-aware context summarization that prevents long conversations from forcing simple queries onto expensive tiers; and (4) an HPC-as-API proxy mode that exposes HPC inference as an OpenAI-compatible endpoint callable from any standard client with no HPC expertise, a deployment pattern made practical only by the sub-second TTFT of contribution (2). Llama 3.2 3B achieves 85.1% free-tier retention on a 1,200-query benchmark spanning ten domains. Measured TTFT: 0.26 s local, 0.54 s HPC (relay), 1.68 s cloud.

翻译：从事大语言模型研究与应用的学者及从业者正面临碎片化的技术格局：本地模型虽免费且私密，但硬件性能限制了可用的模型规模与上下文窗口；机构级HPC中心提供零边际成本的高性能GPU资源，并将数据保留在机构内部，但其运行于防火墙之后且为批处理作业而非交互式使用设计；商业云API可按需提供前沿模型质量，但成本高昂且数据保留策略不适用于敏感研究数据。现有系统无法统一这三类资源。STREAM（面向AI模型的智能分级路由引擎）通过四项贡献弥合了这一差距：（1）三级路由架构，整合本地、HPC与云端推理，并采用基于本地LLM的复杂度评估器；（2）双通道HPC流式架构，将Globus Compute控制平面（认证与作业分发）与WebSocket中继数据平面（令牌投递）分离，使TTFT（首令牌时间）达亚秒级（中位数0.54秒，较批处理模式的11.40秒提升21.1倍），且无需修改VPN或防火墙规则即可穿越机构防火墙，端到端AES-256-GCM加密确保中继操作员无法读取令牌载荷；（3）层级感知的上下文摘要机制，防止长对话将简单查询推送至昂贵层级；（4）HPC即API代理模式，将HPC推理暴露为兼容OpenAI的端点，无需HPC专业知识即可从任意标准客户端调用——该部署模式仅当第（2）项贡献实现亚秒级TTFT后方具实用性。Llama 3.2 3B模型在跨十个领域的1200条查询基准测试中实现85.1%的免费层级保留率。实测TTFT：本地0.26秒，HPC（中继）0.54秒，云端1.68秒。