We present a Bengali mathematical reasoning model called GanitLLM (named after the Bangla word for mathematics, Ganit), together with a new difficulty-aware Bengali math corpus and a curriculum-based GRPO pipeline. Bengali is one of the world's most widely spoken languages, yet existing LLMs either reason in English and then translate, or simply fail on multi-step Bengali math, in part because reinforcement learning recipes are tuned for high-resource languages and collapse under reward sparsity in low-resource settings. To address this, we construct Ganit, a rigorously filtered and decontaminated Bengali math dataset with automatic difficulty tags derived from the pass@k of a strong evaluator model. Building on this dataset, we propose Curriculum-GRPO, which combines multi-stage training (SFT + GRPO) with difficulty-aware sampling and verifiable rewards for format, numerical correctness, and Bengali reasoning. On Bn-MGSM and Bn-MSVAMP, GanitLLM-4B improves over its Qwen3-4B base by +8 and +6 accuracy points, respectively, while increasing the percentage of Bengali reasoning tokens from 14% to over 88% and reducing average solution length from 943 to 193 words. Project page is available at https://dipta007.github.io/GanitLLM
翻译:我们提出了一个名为GanitLLM(源自孟加拉语中数学一词“Ganit”)的孟加拉语数学推理模型,同时构建了一个新颖的难度感知孟加拉语数学语料库以及基于课程的GRPO流程。孟加拉语是全球使用最广泛的语言之一,然而现有的大语言模型要么用英语推理然后翻译,要么在多步骤孟加拉语数学任务上直接失败,部分原因在于强化学习策略是针对高资源语言调优的,在低资源场景下因奖励稀疏而崩溃。为解决这一问题,我们构建了Ganit——一个经过严格筛选和去污染的孟加拉语数学数据集,其自动难度标签源自强评估模型的pass@k指标。基于此数据集,我们提出课程GRPO,该方法将多阶段训练(SFT + GRPO)与难度感知采样及可验证奖励(涵盖格式、数值正确性和孟加拉语推理)相结合。在Bn-MGSM和Bn-MSVAMP基准上,GanitLLM-4B相比其Qwen3-4B基础模型分别提高了+8和+6个准确率点,同时将孟加拉语推理token占比从14%提升至88%以上,并将平均解答长度从943词压缩至193词。项目页面:https://dipta007.github.io/GanitLLM