Bitformer: An efficient Transformer with bitwise operation-based attention for Big Data Analytics at low-cost low-precision devices

In the current landscape of large models, the Transformer stands as a cornerstone, playing a pivotal role in shaping the trajectory of modern models. However, its application encounters challenges attributed to the substantial computational intricacies intrinsic to its attention mechanism. Moreover, its reliance on high-precision floating-point operations presents specific hurdles, particularly evident in computation-intensive scenarios such as edge computing environments. These environments, characterized by resource-constrained devices and a preference for lower precision, necessitate innovative solutions. To tackle the exacting data processing demands posed by edge devices, we introduce the Bitformer model, an inventive extension of the Transformer paradigm. Central to this innovation is a novel attention mechanism that adeptly replaces conventional floating-point matrix multiplication with bitwise operations. This strategic substitution yields dual advantages. Not only does it maintain the attention mechanism's prowess in capturing intricate long-range information dependencies, but it also orchestrates a profound reduction in the computational complexity inherent in the attention operation. The transition from an $O(n^2d)$ complexity, typical of floating-point operations, to an $O(n^2T)$ complexity characterizing bitwise operations, substantiates this advantage. Notably, in this context, the parameter $T$ remains markedly smaller than the conventional dimensionality parameter $d$. The Bitformer model in essence endeavors to reconcile the indomitable requirements of modern computing landscapes with the constraints posed by edge computing scenarios. By forging this innovative path, we bridge the gap between high-performing models and resource-scarce environments, thus unveiling a promising trajectory for further advancements in the field.

翻译：在当前大模型的发展格局中，Transformer作为核心基石，在塑造现代模型演进轨迹中发挥着关键作用。然而，其注意力机制固有的高计算复杂度给实际应用带来了挑战。此外，对高精度浮点运算的依赖也带来了特定困难，尤其在边缘计算等计算密集型场景中表现突出。这些环境以资源受限设备和低精度偏好为特征，亟需创新解决方案。为应对边缘设备严苛的数据处理需求，我们提出Bitformer模型，这是对Transformer范式的创新扩展。其核心创新在于一种新型注意力机制，巧妙地将传统浮点矩阵乘法替换为比特运算。这一策略性替换带来双重优势：不仅保持了注意力机制在捕捉复杂长程信息依赖关系方面的能力，还显著降低了注意力运算的计算复杂度。从浮点运算典型的$O(n^2d)$复杂度演进为比特运算的$O(n^2T)$复杂度，充分体现了这一优势。值得注意的是，在此上下文中，参数$T$始终小于传统维度参数$d$。Bitformer模型本质上致力于协调现代计算环境的严苛要求与边缘计算场景的资源约束。通过开辟这一创新路径，我们弥合了高性能模型与资源匮乏环境之间的鸿沟，为该领域的进一步发展揭示了充满前景的方向。