Dynamically scheduled high-level synthesis (HLS) enables the use of load-store queues (LSQs) which can disambiguate data hazards at circuit runtime, increasing throughput in codes with unpredictable memory accesses. However, the increased throughput comes at the price of lower clock frequency and higher resource usage compared to statically scheduled circuits without LSQs. The lower frequency often nullifies any throughput improvements over static scheduling, while the resource usage becomes prohibitively expensive with large queue sizes. This paper presents a method for achieving dynamically scheduled memory operations in HLS without significant clock period and resource usage increase. We present a novel LSQ based on shift-registers enabled by the opportunity to specialize queue sizes to a target code in HLS. We show a method to speculatively allocate addresses to our LSQ, significantly increasing pipeline parallelism in codes that could not benefit from an LSQ before. In stark contrast to traditional load value speculation, we do not require pipeline replays and have no overhead on misspeculation. On a set of benchmarks with data hazards, our approach achieves an average speedup of 11$\times$ against static HLS and 5$\times$ against dynamic HLS that uses a state of the art LSQ from previous work. Our LSQ also uses several times fewer resources, scaling to queues with hundreds of entries, and supports both on-chip and off-chip memory.
翻译:动态调度的高层次综合(HLS)支持使用加载-存储队列(LSQ),该队列可在电路运行时消除数据冲突,从而提高具有不可预测内存访问的代码的吞吐量。然而,与不使用LSQ的静态调度电路相比,这种吞吐量提升以降低时钟频率和增加资源使用为代价。频率降低往往会抵消相对于静态调度的任何吞吐量改进,而随着队列规模增大,资源使用会变得高得难以承受。本文提出一种方法,可在高层次综合中实现动态调度的内存操作,而无需显著增加时钟周期和资源使用。我们基于移位寄存器提出一种新型LSQ,其优势在于可根据高层次综合中的目标代码定制队列规模。我们还展示了一种方法,可向LSQ推测分配地址,从而显著提升此前无法从LSQ中受益的代码的流水线并行度。与传统的加载值推测截然不同,我们无需流水线回放,且推测错误时无开销。在一组存在数据冲突的基准测试上,我们的方法相比静态高层次综合实现了平均11倍的加速,相比使用先前工作中最先进LSQ的动态高层次综合实现了5倍加速。我们的LSQ资源使用量减少数倍,可扩展至包含数百条目的队列,并同时支持片上与片外内存。