KATANA: A Fast, Low-Power Mapping of Kalman Filters onto Edge NPUs for Real-Time Tracking

State estimation is the closed-loop core of every real-time tracking system, from radar surveillance and counter-UAV defense to autonomous driving and robotics. These deployments run on edge platforms, where defense systems mount on vehicles and drones, and civilian pipelines live on cars and handheld devices. Here, every additional watt of compute erodes mission duration or operational range. Two hard constraints follow: each new measurement must be fused before the next control cycle, and the total compute must fit within a strict battery and thermal power envelope. The Linear and Extended Kalman Filters (LKF, EKF) are dominant estimators on these systems, but today they execute almost exclusively on CPUs, which serialize multi-object tracking (MOT) updates, or on custom FPGA/ASIC accelerators that lengthen design cycles. Contemporary AI-PC SoCs, like the Intel Core Ultra Series 1 and 2, integrate a low-power, data-parallel Neural Processing Unit (NPU). We therefore ask whether the Kalman filter can be mapped onto this existing matrix engine to meet real-time and low-power budgets simultaneously, avoiding a dedicated accelerator and keeping the CPU and GPU free for primary workloads. We present KATANA, an NPU-aware optimization framework delivering the first end-to-end mapping of the LKF and EKF onto a commercial NPU, alongside a cross-platform characterization on shipping AI-PC silicon. KATANA applies three algebraic graph rewrites: subtract-to-add reformulation via a precomputed negative-projection matrix H_neg, static-shape tensor fusion, and block-diagonal batched parallelization, ensuring 100% of operations execute on the DPU matrix engine. On the Series 2, the optimized batched EKF reaches 223.35 FPS at 13.43 W active power, and the LKF reaches 408.73 FPS at 14.05 W, delivering up to a 97.9% reduction in dynamic energy versus the CPU implementation.

翻译：状态估计是每个实时跟踪系统的闭环核心，应用范围从雷达监视、反无人机防御到自动驾驶和机器人技术。这些系统部署在边缘平台上，防御系统安装在车辆和无人机上，民用系统则运行在汽车和手持设备中。在此背景下，每增加一瓦计算功耗都会缩短任务时长或降低作战范围。因此存在两个硬性约束：每个新测量值必须在下一个控制周期之前完成融合，且总计算量必须严格控制在电池与热功耗预算范围内。线性卡尔曼滤波器（LKF）与扩展卡尔曼滤波器（EKF）是这些系统的主流估计算法，但目前它们几乎全部运行在CPU上——这会导致多目标跟踪（MOT）更新串行化——或使用定制FPGA/ASIC加速器（这会延长设计周期）。当代AI-PC系统级芯片（如Intel Core Ultra系列1和2）集成了低功耗、数据并行的神经网络处理单元（NPU）。因此，我们提出疑问：能否将卡尔曼滤波器映射到该现有矩阵引擎上，同时满足实时性和低功耗要求，从而避免专用加速器，并释放CPU和GPU资源用于主要工作负载？我们提出KATANA——一种面向NPU的优化框架，首次实现了LKF和EKF在商用NPU上的端到端映射，并在量产AI-PC芯片上完成跨平台特性分析。KATANA应用了三种代数图重写方法：通过预计算负投影矩阵H_neg实现的减法转加法重构、静态形状张量融合、以及块对角批处理并行化，确保100%操作在DPU矩阵引擎上执行。在系列2平台上，优化后的批处理EKF达到223.35 FPS（有功功率13.43 W），LKF达到408.73 FPS（有功功率14.05 W），相比CPU实现实现了高达97.9%的动态能耗降低。