While the NLP landscape is dominated by multi-billion parameter architectures, their deployment in low-resource, non-Latin scripts remains computationally prohibitive for edge configurations, mobile systems, and decentralized local hardware. This paper presents bangla-smollm-135m, a highly compact 135-million parameter decoder-only foundational model engineered explicitly for high-efficiency language modeling in the Bangla script. By leveraging a deterministic intersect-and-append token merging strategy between TituLLMs and SmolLM2-135M, the model overcomes subword script fragmentation without destabilizing early pretrained parameter states. In zero-shot multi-task benchmark evaluations (PIQA_bn, OpenBookQA_bn, CommonsenseQA_bn, and Bangla_MMLU), bangla-smollm-135m matches or outperforms models twice its size (Gemma-3-270m) and achieves parity with models in the 1B parameter tier. The model is available at rnnandi/bangla-smollm-135m
翻译:尽管自然语言处理领域由数十亿参数架构主导,但这些架构在低资源非拉丁语系中的部署仍面临计算瓶颈,难以适用于边缘设备、移动系统及去中心化的本地硬件。本文提出bangla-smollm-135m——一个高度紧凑的1.35亿参数仅解码器基础模型,专为孟加拉语的高效语言建模而设计。通过采用TituLLMs与SmolLM2-135M之间确定性交集-追加令牌合并策略,该模型在不破坏预训练参数状态稳定性的前提下,克服了子词脚本碎片化问题。在零样本多任务基准评估(PIQA_bn、OpenBookQA_bn、CommonsenseQA_bn及Bangla_MMLU)中,bangla-smollm-135m与两倍规模模型(Gemma-3-270m)性能持平甚至更优,并达到1B参数层级模型的同等水平。该模型已开源发布(rnnandi/bangla-smollm-135m)。