AccelSync: Verifying Synchronization Coverage in Accelerator Pipeline Programs

AI accelerator operators are compiled into multi-stage pipeline programs where DMA, vector, matrix, and scalar units execute concurrently on shared on-chip buffers. A missing or misplaced synchronization primitive introduces hardware-visible data races that escape both simulation and golden testing, because neither models the accelerator's cross-unit visibility semantics. We formalize accelerator pipeline programs as a restricted concurrent language, define a parameterized hardware event semantics with three ordering relations -- program order, synchronization order, and barrier order -- and reduce the correctness question to barrier sufficiency: whether every cross-unit write-read pair on the same buffer is ordered by happens-before. Here "barrier" denotes an abstract ordering primitive in the model, covering vendor pipe barriers, hard-event synchronization, and equivalent frontend-normalized synchronization points. We prove that barrier sufficiency is decidable in $O(|E|^2)$ time and that our checker is both sound and complete under the modeled semantics. We implement AccelSync, a static verification tool instantiated for Ascend 910B2 and Cambricon MLU370 by changing only the hardware model. On 6,292 production kernels from the CANN operator library, AccelSync identifies 3 previously unknown synchronization hazards -- one matching a hazard class for which we observed nondeterministic outputs on Ascend 910B2 under a specific toolkit/driver configuration (CANN 8.0.RC3), though this observation was not reproducible after a subsequent driver upgrade -- and on 120 LLM-generated kernels it flags a 19.2% defect rate (95% CI: [13.0%, 27.4%]). A mutation study on 688 non-equivalent mutants yields 100% detection, and a head-to-head comparison shows AccelSync detects hazards that Huawei's runtime sanitizer msSanitizer misses, at 400x lower cost per kernel.

翻译：AI加速器算子被编译为多级流水线程序，其中DMA、向量、矩阵和标量单元在共享的片上缓冲区上并发执行。缺失或错放的同步原语会导致硬件可见的数据竞争，这种竞争既无法通过仿真也无法通过黄金测试捕获，因为这两者均未建模加速器的跨单元可见性语义。我们将加速器流水线程序形式化为受限并发语言，定义了一种参数化硬件事件语义（包含三种序关系——程序序、同步序和屏障序），并将正确性问题归约为屏障充分性：同一缓冲区上的每一对跨单元写-读操作是否被先行发生关系所排序。此处"屏障"指模型中抽象的排序原语，涵盖供应商管道屏障、硬事件同步以及等价的前端归一化同步点。我们证明屏障充分性可在$O(|E|^2)$时间内判定，且验证器在建模语义下具有完备性与可靠性。我们实现了AccelSync——一款仅需修改硬件模型即可适配Ascend 910B2和寒武纪MLU370的静态验证工具。在来自CANN算子库的6,292个生产级内核上，AccelSync发现了3个此前未知的同步隐患——其中一个与我们在Ascend 910B2上某特定工具链/驱动配置（CANN 8.0.RC3）下观测到非确定性输出的隐患类别相匹配（尽管该现象在后继驱动升级后未复现）——并在120个LLM生成的内核中标记出19.2%的缺陷率（95%置信区间：[13.0%，27.4%]）。针对688个非等价变异体的变异测试实现了100%检测率，而头对头比较表明，AccelSync能够检测出华为运行时消毒器msSanitizer遗漏的隐患，且每个内核的检测成本降低了400倍。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【普林斯顿博士论文】深度学习加速器的编译器支持：端到端评估与数据访问优化

专知会员服务

19+阅读 · 2025年1月7日

《人工智能异构人工智能加速器统一接口（征求意见稿）》国家标准

专知会员服务

37+阅读 · 2024年5月27日

【NeurIPS2023】MultiModN:多模态，多任务，可解释的模块化网络

专知会员服务

40+阅读 · 2023年9月27日

《分布式多智能体强化学习的编码》加州大学等

专知会员服务

56+阅读 · 2022年11月2日