As deep learning-based AI technologies gain momentum, the demand for general-purpose AI computing architectures continues to grow. While GPGPU-based architectures offer versatility for diverse AI workloads, they often fall short in efficiency and cost-effectiveness. Various Domain-Specific Architectures (DSAs) excel at particular AI tasks but struggle to extend across broader applications or adapt to the rapidly evolving AI landscape. M100 is Li Auto's response: a performant, cost-effective architecture for AI inference in Autonomous Driving (AD), Large Language Models (LLMs), and intelligent human interactions, domains crucial to today's most competitive automobile platforms. M100 employs a dataflow parallel architecture, where compiler-architecture co-design orchestrates not only computation but, more critically, data movement across time and space. Leveraging dataflow computing efficiency, our hardware-software co-design improves system performance while reducing hardware complexity and cost. M100 largely eliminates caching: tensor computations are driven by compiler- and runtime-managed data streams flowing between computing elements and on/off-chip memories, yielding greater efficiency and scalability than cache-based systems. Another key principle was selecting the right operational granularity for scheduling, issuing, and execution across compiler, firmware, and hardware. Recognizing commonalities in AI workloads, we chose the tensor as the fundamental data element. M100 demonstrates general AI computing capability across diverse inference applications, including UniAD (for AD) and LLaMA (for LLMs). Benchmarks show M100 outperforms GPGPU architectures in AD applications with higher utilization, representing a promising direction for future general AI computing.
翻译:随着基于深度学习的人工智能技术蓬勃发展,对通用AI计算架构的需求持续增长。虽然基于GPGPU的架构为多样化的AI工作负载提供了灵活性,但它们在效率和成本效益方面往往存在不足。各类领域专用架构(DSA)在特定AI任务上表现出色,但难以扩展到更广泛的应用领域或适应快速演进的AI格局。M100是理想汽车的应对方案:一种面向自动驾驶(AD)、大语言模型(LLM)和智能人机交互(这些对当今最具竞争力的汽车平台至关重要的领域)的高性能、高成本效益AI推理架构。M100采用数据流并行架构,通过编译器-架构协同设计,不仅编排计算,更关键的是编排数据在时空维度上的移动。借助数据流计算效率,我们的软硬件协同设计在提升系统性能的同时,降低了硬件复杂度与成本。M100大幅消除了缓存:张量计算由编译器及运行时管理的数据流驱动,这些数据流在计算单元与片内/片外存储器之间流动,相比基于缓存的系统实现了更高的效率与可扩展性。另一个关键原则是在编译器、固件和硬件之间,为调度、发射和执行选择正确的操作粒度。鉴于AI工作负载的共性,我们选择张量作为基本数据元素。M100在包括UniAD(用于自动驾驶)和LLaMA(用于大语言模型)在内的多种推理应用中展示了通用AI计算能力。基准测试表明,M100在自动驾驶应用中以更高的利用率超越了GPGPU架构,为未来通用AI计算预示了一个有前景的方向。