The boom in Large Language Models (LLMs) like GPT-4 and ChatGPT has marked a significant advancement in artificial intelligence. These models are becoming increasingly complex and powerful to train and serve. This growth in capabilities comes with a substantial increase in computational requirements, both in terms of hardware resources and energy consumption. The goal of this paper is to showcase how hardware and software co-design can come together and allow us to create customized hardware systems for specific LLM workloads. We propose a simulation workflow that allows us to combine model parallelism techniques with a multi-accelerator simulation framework for efficiency metrics. We focus on inference workloads and report power, cycle, and latency metrics upon performing a design space exploration search over multiple software and hardware configurations.
翻译:摘要:以GPT-4和ChatGPT为代表的大语言模型的蓬勃发展,标志着人工智能领域取得了重大突破。这些模型在训练与推理服务方面正变得日益复杂和强大,其能力增长伴随着对硬件资源与能耗等计算需求的显著提升。本文旨在展示软硬件协同设计如何融合,从而为特定LLM工作负载创建定制化硬件系统。我们提出一种仿真工作流程,能够将模型并行技术与多加速器仿真框架相结合以评估效率指标。本研究聚焦于推理工作负载,通过对多种软硬件配置进行设计空间搜索,报告功率、周期及延迟等度量指标。