We devise, implement and performance-asses DYAD, a layer which can serve as a faster and more memory-efficient approximate replacement for linear layers, (nn.Linear() in Pytorch). These layers appear in common subcomponents, such as in the ff module of Transformers. DYAD is based on a bespoke near-sparse matrix structure which approximates the dense "weight" matrix W that matrix-multiplies the input in the typical realization of such a layer, a.k.a DENSE. Our alternative near-sparse matrix structure is decomposable to a sum of 2 matrices permutable to a block-sparse counterpart. These can be represented as 3D tensors, which in unison allow a faster execution of matrix multiplication with the mini-batched input matrix X compared to DENSE (O(rows(W ) x cols(W )) --> O( rows(W ) x cols(W ) # of blocks )). As the crux of our experiments, we pretrain both DYAD and DENSE variants of 2 sizes of the OPT arch and 1 size of the Pythia arch, including at different token scales of the babyLM benchmark. We find DYAD to be competitive (>= 90%) of DENSE performance on zero-shot (e.g. BLIMP), few-shot (OPENLM) and finetuning (GLUE) benchmarks, while being >=7-15% faster to train on-GPU even at 125m scale, besides surfacing larger speedups at increasing scale and model width.
翻译:我们设计、实现并评估了DYAD层,该层可作为线性层(如PyTorch中的`nn.Linear()`)更快、更节省内存的近似替代方案。这类线性层常见于基础子组件中,例如Transformer的前馈(FF)模块。DYAD基于一种定制的近稀疏矩阵结构,可近似替代标准实现(即DENSE)中与输入进行矩阵乘法的密集权重矩阵W。这种近稀疏矩阵结构可分解为两个矩阵之和,且这两个矩阵均可置换为块稀疏形式。这些矩阵可用三维张量表示,其协同作用使得与小批量输入矩阵X的矩阵乘法执行速度相比DENSE更快(复杂度从O(rows(W)×cols(W))降至O(rows(W)×cols(W)/块数))。作为实验核心,我们针对两种规模的OPT架构和一种规模的Pythia架构,分别预训练了DYAD和DENSE变体,并在babyLM基准上采用不同词元规模进行测试。实验表明,DYAD在零样本(如BLIMP)、少样本(OPENLM)和微调(GLUE)基准上的性能可达到DENSE的≥90%,同时在GPU上训练时(即使仅125m参数量规模)速度提升≥7-15%,且在更大规模与更宽模型下展现出更大的加速潜力。