EmbBERT: Attention Under 2 MB Memory

Transformer architectures based on the attention mechanism have revolutionized natural language processing (NLP), driving major breakthroughs across virtually every NLP task. However, their substantial memory and computational requirements still hinder deployment on ultra-constrained devices such as wearables and Internet-of-Things (IoT) units, where available memory is limited to just a few megabytes. To address this challenge, we introduce EmbBERT, a tiny language model (TLM) architecturally designed for extreme efficiency. The model integrates a compact embedding layer, streamlined feed-forward blocks, and an efficient attention mechanism that together enable optimal performance under strict memory budgets. Through this redesign for the extreme edge, we demonstrate that highly simplified transformer architectures remain remarkably effective under tight resource constraints. EmbBERT requires only 2 MB of total memory, and achieves accuracy performance comparable to the ones of state-of-the-art (SotA) models that require a $\mathbf{10\times}$ memory budget. Extensive experiments on the curated TinyNLP benchmark and the GLUE suite confirm that EmbBERT achieves competitive accuracy, comparable to that of larger SotA models, and consistently outperforms downsized versions of BERT and MAMBA of similar size. Furthermore, we demonstrate the model resilience to 8-bit quantization, which further reduces memory usage to just 781 kB , and the scalability of the EmbBERT architecture across the sub-megabyte to tens-of-megabytes range. Finally, we perform an ablation study demonstrating the positive contributions of all components and the pre-training procedure. All code, scripts, and checkpoints are publicly released to ensure reproducibility: https://github.com/RiccardoBravin/tiny-LLM.

翻译：基于注意力机制的Transformer架构彻底革新了自然语言处理领域，推动了几乎所有NLP任务的重大突破。然而，其庞大的内存和计算需求仍阻碍其在可穿戴设备、物联网单元等可用内存仅限数兆字节的超受限设备上的部署。为应对这一挑战，我们提出了EmbBERT——一种专为极致效率设计的微型语言模型。该模型集成了紧凑型嵌入层、精简前馈模块与高效注意力机制，从而在严格内存预算下实现最优性能。通过面向极端边缘场景的重构设计，我们证明高度简化的Transformer架构在严格资源约束下仍能保持显著有效性。EmbBERT仅需2 MB总内存，其准确率性能即可媲美需要10倍内存预算的最先进模型。基于精心设计的TinyNLP基准测试与GLUE套件的大量实验证实，EmbBERT在实现与大型SotA模型相当竞争力的同时，始终优于同等规模的BERT与MAMBA精简版本。此外，我们验证了模型对8位量化的鲁棒性（可将内存消耗进一步压缩至781 kB），以及EmbBERT架构在亚兆字节至数十兆字节范围内的可扩展性。最后，通过消融研究证明了所有组件及预训练流程的积极贡献。所有代码、脚本与检查点均已公开发布以确保可复现性：https://github.com/RiccardoBravin/tiny-LLM。