We present the Federated Inference Resource Scheduling Toolkit (FIRST), a framework enabling Inference-as-a-Service across distributed High-Performance Computing (HPC) clusters. FIRST provides cloud-like access to diverse AI models, like Large Language Models (LLMs), on existing HPC infrastructure. Leveraging Globus Auth and Globus Compute, the system allows researchers to run parallel inference workloads via an OpenAI-compliant API on private, secure environments. This cluster-agnostic API allows requests to be distributed across federated clusters, targeting numerous hosted models. FIRST supports multiple inference backends (e.g., vLLM), auto-scales resources, maintains "hot" nodes for low-latency execution, and offers both high-throughput batch and interactive modes. The framework addresses the growing demand for private, secure, and scalable AI inference in scientific workflows, allowing researchers to generate billions of tokens daily on-premises without relying on commercial cloud infrastructure.
翻译:本文提出联邦推理资源调度工具包(FIRST),这是一个能够在分布式高性能计算(HPC)集群上实现推理即服务的框架。FIRST在现有HPC基础设施上,为大型语言模型(LLMs)等多样化AI模型提供了类云的访问能力。该系统利用Globus Auth和Globus Compute,使研究人员能够通过符合OpenAI规范的API,在私有、安全的环境中运行并行推理任务。这种与集群无关的API允许将请求分发到联邦集群中,以调用众多托管模型。FIRST支持多种推理后端(例如vLLM),可自动扩展资源,维护“热”节点以实现低延迟执行,并提供高吞吐量的批处理和交互式两种模式。该框架满足了科学工作流中对私有、安全且可扩展的AI推理日益增长的需求,使研究人员能够在本地每日生成数十亿tokens,而无需依赖商业云基础设施。