The widespread adoption of large language models (LLMs) has created a pressing need for an efficient, secure and private serving infrastructure, which allows researchers to run open source or custom fine-tuned LLMs and ensures users that their data remains private and is not stored without their consent. While high-performance computing (HPC) systems equipped with state-of-the-art GPUs are well-suited for training LLMs, their batch scheduling paradigm is not designed to support real-time serving of AI applications. Cloud systems, on the other hand, are well suited for web services but commonly lack access to the computational power of HPC clusters, especially expensive and scarce high-end GPUs, which are required for optimal inference speed. We propose an architecture with an implementation consisting of a web service that runs on a cloud VM with secure access to a scalable backend running a multitude of LLM models on HPC systems. By offering a web service using our HPC infrastructure to host LLMs, we leverage the trusted environment of local universities and research centers to offer a private and secure alternative to commercial LLM services. Our solution natively integrates with the HPC batch scheduler Slurm, enabling seamless deployment on HPC clusters, and is able to run side by side with regular Slurm workloads, while utilizing gaps in the schedule created by Slurm. In order to ensure the security of the HPC system, we use the SSH ForceCommand directive to construct a robust circuit breaker, which prevents successful attacks on the web-facing server from affecting the cluster. We have successfully deployed our system as a production service, and made the source code available at \url{https://github.com/gwdg/chat-ai}
翻译:大型语言模型(LLM)的广泛采用催生了对高效、安全且私密的服务基础设施的迫切需求。该基础设施需支持研究人员运行开源或定制微调的LLM,并确保用户数据在未经同意时不被存储且保持私密性。尽管配备先进GPU的高性能计算(HPC)系统非常适合训练LLM,但其批处理调度范式并非为支持AI应用的实时服务而设计。另一方面,云系统虽适用于网络服务,但通常难以利用HPC集群的计算能力——特别是对优化推理速度至关重要且稀缺的高端GPU。本文提出一种架构及其实现方案:该方案包含部署于云虚拟机的网络服务,可通过安全连接访问在HPC系统上运行多种LLM模型的可扩展后端。通过利用我们的HPC基础设施托管LLM并提供网络服务,我们借助本地大学和研究中心的可信环境,为用户提供相较于商业LLM服务更私密、安全的替代方案。该解决方案原生集成HPC批处理调度器Slurm,可在HPC集群上实现无缝部署,并能与常规Slurm工作负载并行运行,同时利用Slurm调度产生的间隙资源。为保障HPC系统的安全性,我们采用SSH ForceCommand指令构建了鲁棒的熔断机制,防止面向网络的服务器遭受的攻击影响集群。我们已成功将本系统部署为生产服务,源代码公开于\url{https://github.com/gwdg/chat-ai}。