Vector processing has become commonplace in today's CPU microarchitectures. Vector instructions improve performance and energy which is crucial for resource-constraint mobile devices. The research community currently lacks a comprehensive benchmark suite to study the benefits of vector processing for mobile devices. This paper presents Swan-an extensive vector processing benchmark suite for mobile applications. Swan consists of a diverse set of data-parallel workloads from four commonly used mobile applications: operating system, web browser, audio/video messaging application, and PDF rendering engine. Using Swan benchmark suite, we conduct a detailed analysis of the performance, power, and energy consumption of vectorized workloads, and show that: (a) Vectorized kernels increase the pressure on cache hierarchy due to the higher rate of memory requests. (b) Vector processing is more beneficial for workloads with lower precision operations and higher cache hit rates. (c) Limited Instruction-Level Parallelism and strided memory accesses to multi-dimensional data structures prevent vector processing benefits from scaling with more SIMD functional units and wider registers. (d) Despite lower computation throughput than domain-specific accelerators, such as GPU, vector processing outperforms these accelerators for kernels with lower operation counts. Finally, we show five common computation patterns in mobile data-parallel workloads that dominate the execution time.
翻译:向量处理在现代CPU微架构中已变得普遍。向量指令能够提升性能和能效,这对于资源受限的移动设备至关重要。目前,研究领域缺乏一套全面的基准测试程序来研究向量处理对移动设备的益处。本文提出了Swan——一套面向移动应用的广泛向量处理基准测试集。Swan包含来自四种常用移动应用的多样化数据并行工作负载:操作系统、网络浏览器、音视频消息应用以及PDF渲染引擎。借助Swan基准测试集,我们详细分析了向量化工作负载的性能、功耗和能耗,并表明:(a) 由于更高的内存请求速率,向量化内核增加了缓存层次结构的压力。(b) 对于具有较低精度操作和较高缓存命中率的工作负载,向量处理更为有益。(c) 有限的指令级并行性和对多维数据结构的分步内存访问,阻碍了向量处理效益随更多SIMD功能单元和更宽寄存器扩展。(d) 尽管计算吞吐量低于专用加速器(例如GPU),但对于操作数量较少的内核,向量处理仍优于这些加速器。最后,我们展示了移动数据并行工作负载中支配执行时间的五种常见计算模式。