Sliding window sums are widely used for string indexing, hashing and time series analysis. We have developed a family of the generic vectorized sliding sum algorithms that provide speedup of O(P/w) for window size $w$ and number of processors P. For a sum with a commutative operator the speedup is improved to O(P/log(w)). Even more important, our algorithms exhibit efficient memory access patterns. In this paper we study the application of the sliding sum algorithms to the training and inference of the Deep Neural Networks. We demonstrate how both pooling and convolution primitives could be expressed as sliding sums and evaluated by the compute kernels with the shared structure. We show that the sliding sum convolution kernels are more efficient than the commonly used GEMM kernels on the CPU, and could even outperform their GPU counterparts.
翻译:滑动窗口求和广泛应用于字符串索引、哈希计算和时间序列分析。我们开发了一类通用向量化滑动求和算法,对于窗口大小$w$和处理单元数P,可实现O(P/w)的加速比。当求和算子满足交换律时,加速比可提升至O(P/log(w))。更重要的是,我们的算法具有高效的内存访问模式。本文研究了滑动求和算法在深度神经网络训练与推理中的应用,证明了池化与卷积基础操作均可表示为滑动求和,并通过共享结构的计算核进行求值。实验表明,滑动求和卷积核在CPU上比常用的GEMM核效率更高,甚至可超越GPU上的对应实现。