The explosive arrival of OpenAI's ChatGPT has fueled the globalization of large language model (LLM), which consists of billions of pretrained parameters that embodies the aspects of syntax and semantics. HyperAccel introduces latency processing unit (LPU), a latency-optimized and highly scalable processor architecture for the acceleration of LLM inference. LPU perfectly balances the memory bandwidth and compute logic with streamlined dataflow to maximize performance and efficiency. LPU is equipped with expandable synchronization link (ESL) that hides data synchronization latency between multiple LPUs. HyperDex complements LPU as an intuitive software framework to run LLM applications. LPU achieves 1.25 ms/token and 20.9 ms/token for 1.3B and 66B model, respectively, which is 2.09x and 1.37x faster than the GPU. LPU, synthesized using Samsung 4nm process, has total area of 0.824 mm2 and power consumption of 284.31 mW. LPU-based servers achieve 1.33x and 1.32x energy efficiency over NVIDIA H100 and L4 servers, respectively.
翻译:OpenAI的ChatGPT的爆炸式到来推动了大语言模型(LLM)的全球化,这些模型包含数十亿预训练参数,体现了句法和语义的各个方面。HyperAccel推出了延迟处理单元(LPU),这是一种针对LLM推理加速的延迟优化高可扩展处理器架构。LPU通过精简的数据流完美平衡内存带宽与计算逻辑,以最大化性能和效率。LPU配备了可扩展同步链路(ESL),可隐藏多个LPU之间的数据同步延迟。HyperDex作为直观的软件框架与LPU互补,用于运行LLM应用。LPU在1.3B和66B模型上分别实现了1.25 ms/令牌和20.9 ms/令牌的延迟,比GPU快2.09倍和1.37倍。采用三星4纳米工艺合成的LPU总面积为0.824平方毫米,功耗为284.31毫瓦。基于LPU的服务器相比NVIDIA H100和L4服务器分别实现了1.33倍和1.32倍的能效提升。