Domain-specific Field Programmable Gate Array (FPGA) architectures increasingly integrate specialized hardblocks, such as Tensor Slices, to accelerate artificial intelligence and machine learning workloads. Despite their efficiency benefits, these architectures remain difficult to program because designers typically rely on manual Register-Transfer Level (RTL) integration to access these hardblocks. This paper presents a compiler-agnostic methodology that enables high-level synthesis (HLS) tools to target custom FPGA hardblocks directly from C/C++ code. Architectural hardblocks are exposed as schedulable C-level operators using an RTL blackbox abstraction with explicit latency and initiation-interval contracts, allowing the HLS scheduler to optimize around specialized hardware without manual RTL orchestration. Unlike traditional uses of HLS blackboxes for external IP integration, our approach treats blackboxes as architectural abstractions, enabling scalable composition of C-level operators that target custom FPGA hardblocks without compiler modification. We evaluate the proposed flow using a Tensor Slice-based FPGA architecture with AMD Vitis HLS and the Verilog-to-Routing (VTR) toolchain. Across multiple matrix sizes, designs generated using the proposed C-Blackbox flow achieve lower area-delay product than behavioral HLS baselines while providing substantially higher productivity-adjusted efficiency than handwritten RTL implementations. These results demonstrate that domain-specific FPGA architectures can be made accessible through HLS while maintaining competitive hardware efficiency.
翻译:领域专用现场可编程门阵列(FPGA)架构日益集成专用硬模块(如Tensor Slices),以加速人工智能和机器学习工作负载。尽管这些架构具有效率优势,但由于设计人员通常依赖手动寄存器传输级(RTL)集成来访问这些硬模块,因此编程仍较为困难。本文提出一种与编译器无关的方法,使高层次综合(HLS)工具能够直接从C/C++代码针对定制FPGA硬模块。架构硬模块通过具有显式延迟和启动间隔约束的RTL黑盒抽象暴露为可调度的C级算子,从而使HLS调度器能够围绕专用硬件进行优化,无需手动RTL编排。与将HLS黑盒用于外部IP集成的传统方式不同,我们的方法将黑盒视为架构抽象,支持针对定制FPGA硬模块的C级算子的可扩展组合,且无需修改编译器。我们使用基于Tensor Slices的FPGA架构,结合AMD Vitis HLS和Verilog-to-Routing(VTR)工具链对所提流程进行评估。在多种矩阵尺寸下,使用所提C-Blackbox流程生成的电路在面积-延迟积上优于行为级HLS基线,同时相比手写RTL实现提供显著更高的生产力调整效率。这些结果表明,领域专用FPGA架构可通过HLS实现可访问性,同时保持有竞争力的硬件效率。