On Algorithmic Cache Optimization

We study matrix-matrix multiplication of two matrices, $A$ and $B$, each of size $n \times n$. This operation results in a matrix $C$ of size $n\times n$. Our goal is to produce $C$ as efficiently as possible given a cache: a 1-D limited set of data values that we can work with to perform elementary operations (additions, multiplications, etc.). That is, we attempt to reuse the maximum amount of data from $A$, $B$ and $C$ during our computation (or equivalently, utilize data in the fast-access cache as often as possible). Firstly, we introduce the matrix-matrix multiplication algorithm. Secondly, we present a standard two-memory model to simulate the architecture of a computer, and we explain the LRU (Least Recently Used) Cache policy (which is standard in most computers). Thirdly, we introduce a basic model Cache Simulator, which possesses an $\mathcal{O}(M)$ time complexity (meaning we are limited to small $M$ values). Then we discuss and model the LFU (Least Frequently Used) Cache policy and the explicit control cache policy. Finally, we introduce the main result of this paper, the $\mathcal{O}(1)$ Cache Simulator, and use it to compare, experimentally, the savings of time, energy, and communication incurred from the ideal cache-efficient algorithm for matrix-matrix multiplication. The Cache Simulator simulates the amount of data movement that occurs between the main memory and the cache of the computer. One of the findings of this project is that, in some cases, there is a significant discrepancy in communication values between an LRU cache algorithm and explicit cache control. We propose to alleviate this problem by ``tricking'' the LRU cache algorithm by updating the timestamp of the data we want to keep in cache (namely entries of matrix $C$). This enables us to have the benefits of an explicit cache policy while being constrained by the LRU paradigm (realistic policy on a CPU).

翻译：本文研究两个大小为 $n \times n$ 的矩阵 $A$ 和 $B$ 的乘法运算，该运算生成一个大小为 $n\times n$ 的矩阵 $C$。我们的目标是给定一个缓存，即一个可用于执行基本运算（加法、乘法等）的一维有限数据集，尽可能高效地生成 $C$。也就是说，我们尝试在计算过程中最大化重用 $A$、$B$ 和 $C$ 中的数据（等价于尽可能频繁地利用快速访问缓存中的数据）。首先，我们介绍矩阵乘法算法。其次，我们提出一个标准的双内存模型来模拟计算机架构，并解释LRU（最近最少使用）缓存策略（大多数计算机的标准策略）。第三，我们引入一个基础模型——缓存模拟器，其时间复杂度为 $\mathcal{O}(M)$（这意味着我们局限于较小的 $M$ 值）。随后，我们讨论并建模LFU（最不频繁使用）缓存策略以及显式控制缓存策略。最后，我们介绍本文的主要结果——$\mathcal{O}(1)$ 时间复杂度的缓存模拟器，并用它实验性地比较理想缓存高效矩阵乘法算法在时间、能耗和通信开销方面的节省。该缓存模拟器模拟主存与缓存之间的数据移动量。本研究的发现之一是，在某些情况下，LRU缓存算法与显式缓存控制之间的通信值存在显著差异。我们提议通过“欺骗”LRU缓存算法来缓解这一问题，即更新我们希望保留在缓存中的数据（即矩阵 $C$ 的元素）的时间戳。这使我们能够在LRU范式（CPU上的现实策略）约束下获得显式缓存策略的优势。