EmbBERT: Attention Under 2 MB Memory

Transformer architectures based on the attention mechanism have revolutionized natural language processing (NLP), driving major breakthroughs across virtually every NLP task. However, their substantial memory and computational requirements still hinder deployment on ultra-constrained devices such as wearables and Internet-of-Things (IoT) units, where available memory is limited to just a few megabytes. To address this challenge, we introduce EmbBERT, a tiny language model (TLM) architecturally designed for extreme efficiency. The model integrates a compact embedding layer, streamlined feed-forward blocks, and an efficient attention mechanism that together enable optimal performance under strict memory budgets. Through this redesign for the extreme edge, we demonstrate that highly simplified transformer architectures remain remarkably effective under tight resource constraints. EmbBERT requires only 2 MB of total memory, and achieves accuracy performance comparable to the ones of state-of-the-art (SotA) models that require a $\mathbf{10\times}$ memory budget. Extensive experiments on the curated TinyNLP benchmark and the GLUE suite confirm that EmbBERT achieves competitive accuracy, comparable to that of larger SotA models, and consistently outperforms downsized versions of BERT and MAMBA of similar size. Furthermore, we demonstrate the model resilience to 8-bit quantization, which further reduces memory usage to just 781 kB , and the scalability of the EmbBERT architecture across the sub-megabyte to tens-of-megabytes range. Finally, we perform an ablation study demonstrating the positive contributions of all components and the pre-training procedure. All code, scripts, and checkpoints are publicly released to ensure reproducibility: https://github.com/RiccardoBravin/tiny-LLM.

翻译：基于注意力机制的Transformer架构彻底变革了自然语言处理（NLP）领域，在几乎所有NLP任务中都取得了重大突破。然而，其巨大的内存与计算需求仍然阻碍了在可穿戴设备和物联网（IoT）单元等超受限设备上的部署，这类设备的可用内存通常仅为数兆字节。为应对这一挑战，我们提出了EmbBERT——一种专为极致效率而架构设计的微型语言模型（TLM）。该模型集成了紧凑的嵌入层、精简的前馈块以及高效的注意力机制，共同确保了在严格内存预算下的最优性能。通过对极端边缘场景的重新设计，我们证明了高度简化的Transformer架构在严苛资源限制下仍能保持卓越效能。EmbBERT仅需2 MB总内存，其准确率性能可与需要$\mathbf{10\times}$内存预算的先进（SotA）模型相媲美。在精心构建的TinyNLP基准测试和GLUE评测集上的大量实验证实，EmbBERT达到了与更大规模SotA模型相当的竞争性准确率，并持续优于同尺寸的BERT和MAMBA缩减版本。此外，我们验证了模型对8位量化的鲁棒性——该技术进一步将内存占用降至仅781 kB，并展示了EmbBERT架构在亚兆字节至数十兆字节范围内的可扩展性。最后，我们通过消融实验证明了所有组件及预训练流程的积极贡献。所有代码、脚本与模型检查点均已公开以确保可复现性：https://github.com/RiccardoBravin/tiny-LLM。