Recent advances in soft GPGPU architectures have shown that a small (<10K LUT), high performance (770 MHz) processor is possible in modern FPGAs. In this paper we architect and evaluate soft SIMT processor banked memories, which can support high bandwidth (up to 16 ports) while maintaining high speed (over 770 MHz). We compare 9 different memory architectures, including simpler multi-port memories, and run a total of 51 benchmarks (different combinations of algorithms, data sizes and processor memories) to develop a comprehensive set of data which will guide the reader in making an informed memory architecture decision for their application. Our benchmarks are comprised of matrix transpositions (memory intensive) and FFTs (split between memory accesses, floating point, and integer computations) to provide a balanced evaluation. We show that the simpler (but more memory block intensive) multi-port memories offer higher performance than the more architecturally complex banked memories for many applications, especially for smaller memories, but the effective footprint cost of the multi-port memories quickly becomes prohibitive as dataset sizes increase. Our banked memory implementation results - high bandwidth, high Fmax, and high density - can be used for other FPGA applications as well, such as HLS (High Level Synthesis).
翻译:近期软通用图形处理器架构的研究进展表明,在现代现场可编程门阵列中实现小型(<1万查找表单元)、高性能(770兆赫兹)的处理器已成为可能。本文针对软单指令多线程处理器设计并评估了分体式存储器架构,该架构在维持高运行速度(超过770兆赫兹)的同时可支持高带宽(最多16个访问端口)。我们比较了九种不同的存储器架构(包括更简易的多端口存储器),并通过运行总计51项基准测试(涵盖不同算法、数据规模与处理器存储器的组合),构建了完整的数据集以指导读者根据具体应用场景作出合理的存储器架构决策。基准测试包含矩阵转置(存储器密集型)与快速傅里叶变换(存储器访问、浮点运算及整数计算的综合任务),以提供均衡的性能评估。研究表明,对于多数应用场景(尤其是较小规模存储器),虽然结构更简易(但占用更多存储块资源)的多端口存储器能提供更高性能,但随着数据集规模增大,其实际资源开销将急剧增加至不可接受的程度。本文实现的分体式存储器具备高带宽、高最大频率与高密度等特性,其设计成果同样适用于其他现场可编程门阵列应用场景,例如高层次综合。