AI accelerator operators are compiled into multi-stage pipeline programs where DMA, vector, matrix, and scalar units execute concurrently on shared on-chip buffers. A missing or misplaced synchronization primitive introduces hardware-visible data races that escape both simulation and golden testing, because neither models the accelerator's cross-unit visibility semantics. We formalize accelerator pipeline programs as a restricted concurrent language, define a parameterized hardware event semantics with three ordering relations -- program order, synchronization order, and barrier order -- and reduce the correctness question to barrier sufficiency: whether every cross-unit write-read pair on the same buffer is ordered by happens-before. Here "barrier" denotes an abstract ordering primitive in the model, covering vendor pipe barriers, hard-event synchronization, and equivalent frontend-normalized synchronization points. We prove that barrier sufficiency is decidable in $O(|E|^2)$ time and that our checker is both sound and complete under the modeled semantics. We implement AccelSync, a static verification tool instantiated for Ascend 910B2 and Cambricon MLU370 by changing only the hardware model. On 6,292 production kernels from the CANN operator library, AccelSync identifies 3 previously unknown synchronization hazards -- one matching a hazard class for which we observed nondeterministic outputs on Ascend 910B2 under a specific toolkit/driver configuration (CANN 8.0.RC3), though this observation was not reproducible after a subsequent driver upgrade -- and on 120 LLM-generated kernels it flags a 19.2% defect rate (95% CI: [13.0%, 27.4%]). A mutation study on 688 non-equivalent mutants yields 100% detection, and a head-to-head comparison shows AccelSync detects hazards that Huawei's runtime sanitizer msSanitizer misses, at 400x lower cost per kernel.
翻译:AI加速器算子被编译为多级流水线程序,其中DMA、向量、矩阵和标量单元在共享的片上缓冲区上并发执行。缺失或错放的同步原语会导致硬件可见的数据竞争,这种竞争既无法通过仿真也无法通过黄金测试捕获,因为这两者均未建模加速器的跨单元可见性语义。我们将加速器流水线程序形式化为受限并发语言,定义了一种参数化硬件事件语义(包含三种序关系——程序序、同步序和屏障序),并将正确性问题归约为屏障充分性:同一缓冲区上的每一对跨单元写-读操作是否被先行发生关系所排序。此处"屏障"指模型中抽象的排序原语,涵盖供应商管道屏障、硬事件同步以及等价的前端归一化同步点。我们证明屏障充分性可在$O(|E|^2)$时间内判定,且验证器在建模语义下具有完备性与可靠性。我们实现了AccelSync——一款仅需修改硬件模型即可适配Ascend 910B2和寒武纪MLU370的静态验证工具。在来自CANN算子库的6,292个生产级内核上,AccelSync发现了3个此前未知的同步隐患——其中一个与我们在Ascend 910B2上某特定工具链/驱动配置(CANN 8.0.RC3)下观测到非确定性输出的隐患类别相匹配(尽管该现象在后继驱动升级后未复现)——并在120个LLM生成的内核中标记出19.2%的缺陷率(95%置信区间:[13.0%,27.4%])。针对688个非等价变异体的变异测试实现了100%检测率,而头对头比较表明,AccelSync能够检测出华为运行时消毒器msSanitizer遗漏的隐患,且每个内核的检测成本降低了400倍。