Principal Component Analysis (PCA) is widely used for dimensionality reduction in hyperspectral imaging, genomics, and neurosciences. However, it suffers from computational bottlenecks in matrix multiplication and singular value decomposition (SVD). Prior PCA hardware accelerators either target only one of these stages, rely on High Level Synthesis (HLS) that limits microarchitectural optimizations or use fixed point datapaths with limited dataset scalability. There is a need for a unified PCA accelerator that is suitable for datasets of any input dimension. Hence, the proposed work presents MANOJAVAM, a scalable PCA accelerator fabric, unifying matrix multiplication and SVD in a single architecture. MANOJAVAM(T,S) comprises an S number of TxT TPU-style systolic arrays employing block streaming for high-throughput matrix multiplication. It further integrates a highly parallel Jacobian unit implementing the Jacobi method for SVD with pipelined CORDIC based rotations. A two tier cache hierarchy and mode-aware memory policies adapts to the distinct memory access patterns of covariance matrix and rotation computation. For demonstration, MANOJAVAM(4,8) is realized on a Xilinx Artix-7 FPGA, achieving a frequency of 200 MHz at 1.271W. MANOJAVAM(16,32) is realized on Xilinx Virtex-Ultrascale+ FPGA, achieving a frequency of 434 MHz at 16.957W. Benchmarking on real-world datasets reveals that MANOJAVAM(16,32) achieves up to a 22.75x speedup in SVD latency and a 42.14x reduction in total energy consumption compared to a high-performance NVIDIA A6000 GPU. The architecture offers a unified, scalable, and energy-efficient platform for large-scale data analytics in both high-performance and edge-computing environments.
翻译:主成分分析(PCA)广泛应用于高光谱成像、基因组学和神经科学领域的降维处理。然而,其计算瓶颈主要存在于矩阵乘法与奇异值分解(SVD)阶段。现有PCA硬件加速器要么仅针对单一阶段进行优化,要么依赖限制微架构优化能力的高层次综合(HLS)技术,或采用固定位宽数据通路导致数据集扩展性受限。因此,亟需一种适用于任意维度输入数据的统一PCA加速器。本文提出MANOJAVAM——一种可扩展的PCA加速器架构,将矩阵乘法与SVD统一集成于单一架构中。MANOJAVAM(T,S)包含S个TxT规格的TPU风格脉动阵列,采用块流式处理实现高吞吐矩阵乘法;同时集成高度并行的雅可比单元,通过基于CORDIC流水线旋转的雅可比方法实现SVD。双层缓存层次结构与模式感知内存策略可自适应协方差矩阵计算与旋转运算的不同内存访问模式。为验证性能,在Xilinx Artix-7 FPGA上实现MANOJAVAM(4,8)原型,工作频率200 MHz,功耗1.271W;在Xilinx Virtex-Ultrascale+ FPGA上实现MANOJAVAM(16,32)原型,工作频率434 MHz,功耗16.957W。基于真实数据集的基准测试表明,相较于高性能NVIDIA A6000 GPU,MANOJAVAM(16,32)的SVD延迟加速比达22.75倍,总能耗降低42.14倍。该架构为高性能计算与边缘计算环境中的大规模数据分析提供了统一、可扩展且高能效的计算平台。