Multimodal stacks that mix ViTs, CNNs, GNNs, and transformer NLP strain embedded platforms because their compute/memory patterns diverge and hard real-time targets leave little slack. TRINE is a single-bitstream FPGA accelerator and compiler that executes end-to-end multimodal inference without reconfiguration. Layers are unified as DDMM/SDDMM/SpMM and mapped to a mode-switchable engine that toggles at runtime among weight/output-stationary systolic, 1xCS SIMD, and a routable adder tree (RADT) on a shared PE array. A width-matched, two-stage top-k unit enables in-stream token pruning, while dependency-aware layer offloading (DALO) overlaps independent kernels across reconfigurable processing units to sustain utilization. Evaluated on Alveo U50 and ZCU104, TRINE reduces latency by up to 22.57x vs. RTX 4090 and 6.86x vs. Jetson Orin Nano at 20-21 W; token pruning alone yields up to 7.8x on ViT-heavy pipelines, and DALO contributes up to 79% throughput improvement. With int8 quantization, accuracy drops remain <2.5% across representative tasks, delivering state-of-the-art latency and energy efficiency for unified vision, language, and graph workloads-in one bitstream.
翻译:混合了ViT、CNN、GNN和Transformer NLP的多模态堆栈对嵌入式平台构成严峻挑战,因其计算/存储模式存在差异,且硬实时目标几乎没有松弛余地。TRINE是一种单比特流FPGA加速器与编译器,无需重配置即可执行端到端多模态推理。各层被统一表示为DDMM/SDDMM/SpMM,并映射至一个模式可切换引擎,该引擎能在权重/输出驻留脉动阵列、1xCS SIMD以及共享PE阵列上的可路由加法树(RADT)之间进行运行时切换。一个宽度匹配的两阶段top-k单元支持流式令牌剪枝,而依赖感知层卸载(DALO)则通过将独立内核重叠在可重构处理单元上来维持利用率。在Alveo U50和ZCU104上的评估显示,TRINE在20-21W功耗下相比RTX 4090延迟降低高达22.57倍,相比Jetson Orin Nano延迟降低6.86倍;仅令牌剪枝即可在ViT密集型流水线上实现高达7.8倍加速,而DALO贡献了最高79%的吞吐量提升。采用int8量化后,代表性任务的精度下降仍低于2.5%,为统一的视觉、语言和图负载提供了最先进的延迟与能效——全部在单一比特流中实现。