Microarchitectural Co-Optimization for Sustained Throughput of RISC-V Multi-Lane Chaining Vector Processors

Modern RISC vector processors rely on the synergy of multi-lane parallelism and chaining to achieve high sustained throughput, yet their achieved performance often falls substantially short of the theoretical performance bound due to microarchitectural inefficiencies. In this work, we take the open-source RVV processor Ara as the target platform and analyze the sources of its sustained-throughput loss and optimize the design accordingly. We first establish an ideal multi-lane chaining execution model as a microarchitectural reference for the ideal steady-state progression of the vector backend. Based on this model, we attribute Ara's key bottlenecks to inefficiencies along three critical execution paths: memory-side inefficiencies in data supply and transaction issuance, control-side inefficiencies caused by conservative dependence management and issue control, and operand-delivery inefficiencies caused by access conflicts and result-propagation overhead. To address these bottlenecks, we propose a coordinated set of microarchitectural optimizations. Experimental results show that, without increasing raw memory bandwidth or changing the main processor configuration, Ara-Opt achieves a geometric-mean speedup of 1.33x over baseline Ara. Under roofline-based normalization, the geometric-mean gap-closed ratio reaches 12.2%. In particular, scal, axpy, ger, and gemm achieve speedups of approximately 2.41x, 1.60x, 1.52x, and 1.42x, with corresponding gap-closed ratios of 93.7%, 88.9%, 78.3%, and 59.3%, respectively. These results show that the proposed method can effectively recover sustained-throughput capability lost to microarchitectural inefficiencies in Ara under essentially unchanged hardware resource constraints, and move the implementation points of regular streaming and high-throughput workloads significantly closer to the theoretical performance bound.

翻译：现代RISC向量处理器依赖多通道并行性与链式处理的协同作用以实现高持续吞吐量，然而由于微架构低效问题，其实际性能往往远低于理论性能界限。本文以开源RVV向量处理器Ara为基准平台，分析其持续吞吐量损失的根源并据此优化设计。首先建立理想多通道链式执行模型作为向量后端理想稳态进度的微架构参照。基于该模型，将Ara的关键瓶颈归因于三条关键执行路径的低效：数据供给与事务发起的存储器侧低效、保守依赖管理与发射控制导致的控制侧低效、以及访问冲突与结果传播开销引发的操作数传递低效。针对这些瓶颈，提出一组协调的微架构优化方案。实验结果表明，在不增加原始访存带宽或改变主处理器配置的情况下，Ara-Opt相较基线Ara实现了1.33倍的几何平均加速比。基于屋顶线归一化后，几何平均间隙闭合比达到12.2%。其中scal、axpy、ger与gemm分别获得约2.41倍、1.60倍、1.52倍和1.42倍的加速比，对应间隙闭合比分别为93.7%、88.9%、78.3%与59.3%。这些结果表明，所提方法能在硬件资源约束基本不变的情况下，有效恢复Ara因微架构低效而损失的持续吞吐能力，使常规流式与高吞吐工作负载的实现点显著逼近理论性能界限。