Rotary positional embedding has become the state-of-the-art approach to encode position information in transformer-based models. While it is often succinctly expressed in complex linear algebra, we note that the actual implementation of $Q/K/V$-projections is not equivalent to a complex linear transformation. We argue that complex linear transformation is a more natural parametrization and saves near 50\% parameters within the attention block. We show empirically that removing such redundancy has negligible impact on the model performance. Our modification achieves more efficient parameter usage, as well as a cleaner interpretation of the representation space.
翻译:旋转位置编码已成为基于Transformer模型中对位置信息进行编码的最先进方法。尽管它通常以复线性代数形式简洁表达,但我们注意到$Q/K/V$投影的实际实现并不等同于复线性变换。我们论证复线性变换是一种更为自然的参数化方式,并能在注意力块内节省近50%的参数。实验表明,消除这种冗余对模型性能的影响可忽略不计。我们的改进实现了更高效的参数利用,同时对表示空间提供了更清晰的理解。