The devices designed for the Internet-of-Things encompass a large variety of distinct processor architectures, forming a highly heterogeneous zoo. In order to tackle this, we employ a simulator to estimate the performance of the matrix-matrix multiplication (GEMM) kernel on processors designed to operate at the edge. Our simulator adheres to the modern implementations of GEMM, advocated by GotoBLAS2, BLIS, OpenBLAS, etc., to carefully account for the amount of data transfers across the memory hierarchy of different algorithmic variants of the kernel. %Armed with this tool, A small collection of experiments provide the necessary data to calibrate the simulator and deliver highly accurate estimations of the execution time for a given processor architecture.
翻译:面向物联网设计的设备涵盖了多种不同的处理器架构,形成了一个高度异构的"动物园"。为了应对这一挑战,我们采用模拟器来评估为边缘运行而设计的处理器上矩阵乘法(GEMM)内核的性能。我们的模拟器遵循由GotoBLAS2、BLIS、OpenBLAS等倡导的现代GEMM实现方案,仔细计算了该内核不同算法变体在存储器层次结构中的数据传输量。通过少量实验,我们提供了必要的数据来校准模拟器,从而为给定处理器架构的执行时间提供高精度的估计。