Orion: Characterizing and Programming Apple's Neural Engine for LLM Training and Inference

Over two billion Apple devices ship with a Neural Processing Unit (NPU) - the Apple Neural Engine (ANE) - yet this accelerator remains largely unused for large language model workloads. CoreML, Apple's public ML framework, imposes opaque abstractions that prevent direct ANE programming and do not support on-device training. We present Orion, to our knowledge the first open end-to-end system that combines direct ANE execution, a compiler pipeline, and stable multi-step training with checkpoint resume in a single native runtime, bypassing CoreML entirely via Apple's private _ANEClient and _ANECompiler APIs. Building on prior characterization work by maderix, we extend public knowledge of ANE constraints to a catalog of 20 restrictions on MIL IR programs, memory layout, compilation limits, and numerical behavior, including 14 previously undocumented constraints discovered during Orion development. Orion includes a compiler that lowers a graph IR through five optimization passes to ANE-native MIL and a runtime that manages IOSurface-backed zero-copy tensor I/O, program caching, and delta compilation for weight updates. Because the ANE bakes weights at compile time, naive training normally requires full recompilation per step (~4.2 s). We show that compiled programs can instead be updated by unloading, patching weight files, and reloading, bypassing ANECCompile() and reducing recompilation from 4,200 ms to 494 ms per step (8.5x), yielding a 3.8x training speedup. On an M4 Max, Orion achieves 170+ tokens/s for GPT-2 124M inference and demonstrates stable training of a 110M-parameter transformer on TinyStories for 1,000 steps in 22 minutes with zero NaN occurrences. We also present LoRA adapter-as-input, enabling hot-swap of adapters via IOSurface inputs without recompilation.

翻译：超过二十亿台苹果设备搭载了神经处理单元（NPU）——即苹果神经引擎（ANE），然而该加速器在大语言模型工作负载中仍基本处于未利用状态。苹果的公开机器学习框架CoreML采用了不透明的抽象层，既阻碍了直接对ANE进行编程，也不支持设备端训练。本文提出Orion系统，据我们所知，这是首个开源的端到端系统，它集成了直接ANE执行、编译器流水线以及支持检查点恢复的稳定多步训练功能于单一原生运行时中，通过苹果私有接口_ANEClient与_ANECompiler API完全绕过了CoreML。基于maderix先前的特性分析工作，我们将ANE约束的公开知识扩展至包含20项限制的清单，涵盖MIL中间表示程序、内存布局、编译限制及数值行为等方面，其中包含在Orion开发过程中发现的14项先前未记录的约束。Orion包含一个编译器，其通过五层优化将图中间表示下译为ANE原生的MIL代码；以及一个运行时，负责管理基于IOSurface的零拷贝张量输入输出、程序缓存和面向权重更新的增量编译。由于ANE在编译时固化权重，朴素的训练方法通常需要每步进行完整重编译（约4.2秒）。我们证明，编译后的程序可通过卸载、修补权重文件并重新加载的方式进行更新，从而绕过ANECCompile()调用，将每步重编译时间从4,200毫秒降至494毫秒（提升8.5倍），最终实现3.8倍的训练加速。在M4 Max芯片上，Orion实现了GPT-2 124M模型超过170词元/秒的推理速度，并在TinyStories数据集上对1.1亿参数Transformer进行了1,000步稳定训练（耗时22分钟且未出现任何NaN异常）。我们还提出了“输入式LoRA适配器”方案，可通过IOSurface输入实现适配器的热切换而无需重新编译。