M100: An Orchestrated Dataflow Architecture Powering General AI Computing

Yan Xie,Changkui Mao,Changsong Wu,Chao Lu,Chao Suo,Cheng Qian,Chun Yang,Danyang Zhu,Hengchang Xiong,Hongzhan Lu,Hongzhen Liu,Jiafu Liu,Jie Chen,Jie Dai,Junfeng Tang,Kai Liu,Kun Li,Lipeng Ge,Meng Sun,Min Luo,Peng Chen,Peng Wang,Shaodong Yang,Shibin Tang,Shibo Chen,Weikang Zhang,Xiao Ling,Xiaobo Du,Xin Wu,Yang Liu,Yi Jiang,Yihua Jin,Yin Huang,Yuli Zhang,Zhen Yuan,Zhiyuan Man,Zhongxiao Yao

from arxiv, Accepted to appear at ISCA 2026 Industry Track. 12 pages, 16 figures

As deep learning-based AI technologies gain momentum, the demand for general-purpose AI computing architectures continues to grow. While GPGPU-based architectures offer versatility for diverse AI workloads, they often fall short in efficiency and cost-effectiveness. Various Domain-Specific Architectures (DSAs) excel at particular AI tasks but struggle to extend across broader applications or adapt to the rapidly evolving AI landscape. M100 is Li Auto's response: a performant, cost-effective architecture for AI inference in Autonomous Driving (AD), Large Language Models (LLMs), and intelligent human interactions, domains crucial to today's most competitive automobile platforms. M100 employs a dataflow parallel architecture, where compiler-architecture co-design orchestrates not only computation but, more critically, data movement across time and space. Leveraging dataflow computing efficiency, our hardware-software co-design improves system performance while reducing hardware complexity and cost. M100 largely eliminates caching: tensor computations are driven by compiler- and runtime-managed data streams flowing between computing elements and on/off-chip memories, yielding greater efficiency and scalability than cache-based systems. Another key principle was selecting the right operational granularity for scheduling, issuing, and execution across compiler, firmware, and hardware. Recognizing commonalities in AI workloads, we chose the tensor as the fundamental data element. M100 demonstrates general AI computing capability across diverse inference applications, including UniAD (for AD) and LLaMA (for LLMs). Benchmarks show M100 outperforms GPGPU architectures in AD applications with higher utilization, representing a promising direction for future general AI computing.

翻译：随着基于深度学习的人工智能技术蓬勃发展，对通用AI计算架构的需求持续增长。虽然基于GPGPU的架构为多样化的AI工作负载提供了灵活性，但它们在效率和成本效益方面往往存在不足。各类领域专用架构（DSA）在特定AI任务上表现出色，但难以扩展到更广泛的应用领域或适应快速演进的AI格局。M100是理想汽车的应对方案：一种面向自动驾驶（AD）、大语言模型（LLM）和智能人机交互（这些对当今最具竞争力的汽车平台至关重要的领域）的高性能、高成本效益AI推理架构。M100采用数据流并行架构，通过编译器-架构协同设计，不仅编排计算，更关键的是编排数据在时空维度上的移动。借助数据流计算效率，我们的软硬件协同设计在提升系统性能的同时，降低了硬件复杂度与成本。M100大幅消除了缓存：张量计算由编译器及运行时管理的数据流驱动，这些数据流在计算单元与片内/片外存储器之间流动，相比基于缓存的系统实现了更高的效率与可扩展性。另一个关键原则是在编译器、固件和硬件之间，为调度、发射和执行选择正确的操作粒度。鉴于AI工作负载的共性，我们选择张量作为基本数据元素。M100在包括UniAD（用于自动驾驶）和LLaMA（用于大语言模型）在内的多种推理应用中展示了通用AI计算能力。基准测试表明，M100在自动驾驶应用中以更高的利用率超越了GPGPU架构，为未来通用AI计算预示了一个有前景的方向。

相关内容

关注 7110

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

构建面向终端的 AI 编程智能体：脚手架、测试环境、上下文工程及实践经验

专知会员服务

25+阅读 · 3月8日

中国信通院规划所发布《人工智能算力基础设施赋能研究报告（2025年）》

专知会员服务

22+阅读 · 2025年12月7日

《面向边缘AI应用的高性能高能效架构探索》156页

专知会员服务

37+阅读 · 2025年4月12日

AI手机：AI发展重心逐步向端侧转移，苹果有望开启AI手机换机浪潮

专知会员服务

28+阅读 · 2024年8月25日