This paper presents a benchmarking methodology for evaluating end-to-end performance of deterministic signal-processing pipelines expressed using CNN-compatible primitives. The benchmark targets phased-array workloads such as ultrasound imaging and evaluates complete RF-to-image pipelines under realistic execution conditions. Performance is reported using sustained input throughput (MB/s), effective frame rate (FPS), and, where available, incremental energy per run and peak memory usage. Using this methodology, we benchmark a single deterministic, training-free CNN-based signal-processing pipeline executed unmodified across heterogeneous accelerator platforms, including an NVIDIA RTX 5090 GPU and a Google TPU v5e-1. The results demonstrate how different operator formulations (dynamic indexing, fully CNN-expressed, and sparse-matrix-based) impact performance and portability across architectures. This work is motivated by the need for portable, certifiable signal-processing implementations that avoid hardware-specific refactoring while retaining high performance on modern AI accelerators.
翻译:本文提出了一种基准测试方法,用于评估使用CNN兼容原语表达的确定性信号处理流水线的端到端性能。该基准测试针对相控阵工作负载(如超声成像),并在实际执行条件下评估完整的射频到图像流水线。性能报告采用持续输入吞吐量(MB/s)、有效帧率(FPS),并在可用时提供每次运行的增量能耗和峰值内存使用量。运用此方法,我们对一个单一的、确定性的、无需训练的基于CNN的信号处理流水线进行了基准测试,该流水线未经修改地在异构加速器平台上执行,包括NVIDIA RTX 5090 GPU和Google TPU v5e-1。结果展示了不同的算子实现方式(动态索引、完全CNN表达和基于稀疏矩阵)如何影响跨架构的性能和可移植性。本工作的动机源于对可移植、可认证的信号处理实现的需求,这些实现应避免针对特定硬件的重构,同时在现代AI加速器上保持高性能。