In this work, we demonstrate the application of a first-order Taylor expansion to approximate a generic function $F: R^{n \times m} \to R^{n \times m}$ and utilize it in language modeling. To enhance the basic Taylor expansion, we introduce iteration and piecewise modeling, leading us to name the algorithm the Iterative Piecewise Affine (IPA) approximation. The final algorithm exhibits interesting resemblances to the Transformers decoder architecture. By comparing parameter arrangements in IPA and Transformers, we observe a strikingly similar performance, with IPA outperforming Transformers by 1.5\% in the next token prediction task with cross-entropy loss for smaller sequence lengths.
翻译:在本工作中,我们演示了一阶泰勒展开在逼近通用函数 $F: R^{n \times m} \to R^{n \times m}$ 中的应用,并将其用于语言建模。为增强基本泰勒展开的性能,我们引入了迭代与分段建模方法,由此将算法命名为迭代分段仿射(IPA)近似。最终算法展现出与Transformer解码器架构的有趣相似性。通过对比IPA与Transformer的参数配置,我们观察到两者性能惊人地相似——在处理较短序列的下一词元预测任务(采用交叉熵损失)时,IPA的性能比Transformer高出1.5%。