Real-time, energy-efficient inference on edge devices is essential for graph classification across a range of applications. Hyperdimensional Computing (HDC) is a brain-inspired computing paradigm that encodes input features into low-precision, high-dimensional vectors with simple element-wise operations, making it well-suited for resource-constrained edge platforms. Recent work enhances HDC accuracy for graph classification via Nyström kernel approximations. Edge acceleration of such methods faces several challenges: (i) redundancy among (landmark) samples selected via uniform sampling, (ii) storing the Nyström projection matrix under limited on-chip memory, (iii) expensive, contention-prone codebook lookups, and (iv) load imbalance due to irregular sparsity in SpMV. To address these challenges, we propose HyperX, the first end-to-end FPGA accelerator for Nyström-based HDC graph classification at the edge. HyperX integrates four key optimizations: (i) a hybrid landmark selection strategy combining uniform sampling with determinantal point processes (DPPs) to reduce redundancy while improving accuracy; (ii) a streaming architecture for Nyström projection matrix maximizing external memory bandwidth utilization; (iii) a minimal-perfect-hash lookup engine enabling $O(1)$ key-to-index mapping; and (iv) sparsity-aware SpMV engines with static load balancing. Implemented on an AMD Zynq UltraScale+ (ZCU104) FPGA, HyperX achieves $6.85\times$ ($4.32\times$) speedup and $169\times$ ($314\times$) energy efficiency gains over optimized CPU (GPU) baselines, while improving classification accuracy by $3.4\%$ on average across TUDataset benchmarks, a widely used standard for graph classification.
翻译:面向边缘设备的实时低功耗图分类在众多应用中至关重要。超维计算(HDC)是一种受大脑启发的计算范式,通过简单逐元素运算将输入特征编码为低精度高维向量,特别适用于资源受限的边缘平台。近期研究利用Nyström核近似提升了HDC在图分类任务中的准确率。此类方法的边缘加速面临四大挑战:(i)均匀采样选取的地标样本存在冗余性;(ii)在有限片上存储中存储Nyström投影矩阵;(iii)昂贵且易产生竞争冲突的码本查找;(iv)稀疏矩阵向量乘中不规则稀疏性导致的负载不均衡。针对这些问题,我们提出HyperX——首个面向边缘端基于Nyström的HDC图分类的端到端FPGA加速器。HyperX集成四项关键优化:(i)结合均匀采样与行列式点过程的混合地标选择策略,在降低冗余的同时提升精度;(ii)流式架构实现Nyström投影矩阵最大化外部内存带宽利用率;(iii)最小完美哈希查找引擎支持O(1)键值映射;(iv)具有静态负载均衡的稀疏感知型SpMV引擎。在AMD Zynq UltraScale+ (ZCU104) FPGA上实现的HyperX相比优化后的CPU(GPU)基线取得6.85倍(4.32倍)加速比和169倍(314倍)能效提升,同时在图分类标准基准TUDataset各任务上平均提升3.4%的分类精度。