The increasing adoption of large language models (LLMs) has created a pressing need for an efficient, secure and private serving infrastructure, which allows researchers to run open-source or custom fine-tuned LLMs and ensures users that their data remains private and is not stored without their consent. While high-performance computing (HPC) systems equipped with state-of-the-art GPUs are well-suited for training LLMs, their batch scheduling paradigm is not designed to support real-time serving of AI applications. Cloud systems, on the other hand, are well suited for web services but commonly lack access to the computational power of clusters, especially expensive and scarce high-end GPUs, which are required for optimal inference speed. We propose an architecture with an implementation consisting of a web service that runs on a cloud VM with secure access to a scalable backend running a multitude of AI models on HPC systems. By offering a web service using our HPC infrastructure to host LLMs, we leverage the trusted environment of local universities and research centers to offer a private and secure alternative to commercial LLM services. Our solution natively integrates with Slurm, enabling seamless deployment on HPC clusters and is able to run side by side with regular Slurm workloads, while utilizing gaps in the schedule created by Slurm. In order to ensure the security of the HPC system, we use the SSH ForceCommand directive to construct a robust circuit breaker, which prevents successful attacks on the web-facing server from affecting the cluster. We have successfully deployed our system as a production service, and made the source code available at https://github.com/gwdg/chat-ai
翻译:随着大规模语言模型(LLM)的日益普及,亟需构建高效、安全且私密的服务基础设施。该设施应支持研究者运行开源或定制微调的LLM,并确保用户数据在未经许可时不被留存。尽管配备先进GPU的高性能计算(HPC)系统非常适合训练LLM,但其批处理调度范式并不适用于AI应用的实时服务。另一方面,云系统虽适配网络服务,却通常难以获得集群的计算能力——特别是对优化推理速度至关重要且昂贵稀缺的高端GPU。本文提出一种架构及其实现方案:该方案包含部署于云虚拟机的网络服务,可通过安全连接访问在HPC系统上运行多种AI模型的可扩展后端。通过利用我们的HPC基础设施托管LLM并提供网络服务,我们借助本地大学与研究中心的可信环境,为用户提供相较于商业LLM服务更具私密性与安全性的替代方案。本解决方案原生集成Slurm,可在HPC集群上实现无缝部署,并能与常规Slurm工作负载并行运行,同时利用Slurm调度产生的间隙资源。为保障HPC系统安全,我们采用SSH ForceCommand指令构建了鲁棒的熔断机制,防止面向网络的服务器遭受的攻击影响集群。该系统已作为生产服务成功部署,源代码发布于https://github.com/gwdg/chat-ai。