This study presents a benchmarking analysis of the Qualcomm Cloud AI 100 Ultra (QAic) accelerator for large language model (LLM) inference, evaluating its energy efficiency (throughput per watt), performance, and hardware scalability against NVIDIA A100 GPUs (in 4x and 8x configurations) within the National Research Platform (NRP) ecosystem. A total of 12 open-source LLMs, ranging from 124 million to 70 billion parameters, are served using the vLLM framework. Our analysis reveals that QAic achieves competitive energy efficiency with advantages on specific models while enabling more granular hardware allocation: some 70B models operate on as few as 1 QAic card versus 8 A100 GPUs required, with 20x lower power consumption (148W vs 2,983W). For smaller models, single QAic devices achieve up to 35x lower power consumption compared to our 4-GPU A100 configuration (36W vs 1,246W). The findings offer insights into the potential of the Qualcomm Cloud AI 100 Ultra for energy-constrained and resource-efficient HPC deployments within the National Research Platform (NRP).
翻译:本研究对高通Cloud AI 100 Ultra(QAic)加速器在大语言模型推理任务上进行了基准测试分析,评估了其能效(每瓦吞吐量)、性能及硬件可扩展性,并与国家研究平台生态系统内的英伟达A100 GPU(4卡与8卡配置)进行了对比。研究使用vLLM框架部署了总计12个开源大语言模型,参数量范围从1.24亿至700亿。分析表明,QAic在特定模型上展现出竞争优势的能效表现,同时支持更细粒度的硬件分配:部分700亿参数模型仅需1张QAic卡即可运行,而同等任务需要8张A100 GPU,且功耗降低20倍(148W对比2,983W)。对于较小模型,单张QAic设备相比4卡A100配置功耗降低最高达35倍(36W对比1,246W)。这些发现揭示了高通Cloud AI 100 Ultra在国家研究平台内面向能源受限与资源高效型高性能计算部署的应用潜力。