We present a Bengali mathematical reasoning model called GanitLLM (named after the Bangla word for mathematics, "Ganit"), together with a new difficulty-aware Bengali math corpus and a curriculum-based GRPO pipeline. Bengali is one of the world's most widely spoken languages, yet existing LLMs either reason in English and then translate, or simply fail on multi-step Bengali math, in part because reinforcement learning recipes are tuned for high-resource languages and collapse under reward sparsity in low-resource settings. To address this, we construct Ganit, a rigorously filtered and decontaminated Bengali math dataset with automatic difficulty tags derived from the pass@k of a strong evaluator model. Building on this dataset, we propose Curriculum-GRPO, which combines multi-stage training (SFT + GRPO) with difficulty-aware sampling and verifiable rewards for format, numerical correctness, and Bengali reasoning. On Bn-MGSM and Bn-MSVAMP, GanitLLM-4B improves over its Qwen3-4B base by +8 and +7 accuracy points, respectively, while increasing the percentage of Bengali reasoning tokens from 14% to over 88% and reducing average solution length from 943 to 193 words.
翻译:本文提出了一种名为GanitLLM(以孟加拉语数学词汇"Ganit"命名)的孟加拉语数学推理模型,同时构建了新型难度感知孟加拉语数学语料库及基于课程的GRPO训练流程。孟加拉语作为全球使用最广泛的语言之一,现有大语言模型在处理多步骤孟加拉语数学问题时,要么依赖英语推理后翻译,要么直接失效,部分原因在于强化学习方案主要针对高资源语言设计,在低资源场景的稀疏奖励环境下容易失效。为解决此问题,我们构建了Ganit数据集——一个经过严格筛选和去污染的孟加拉语数学数据集,其自动难度标签源自强评估模型的pass@k指标。基于该数据集,我们提出课程式GRPO方法,该方法融合多阶段训练(SFT + GRPO)、难度感知采样机制,以及针对格式、数值正确性和孟加拉语推理的可验证奖励机制。在Bn-MGSM和Bn-MSVAMP基准测试中,GanitLLM-4B相较于其基础模型Qwen3-4B分别提升8和7个准确率百分点,同时将孟加拉语推理标记比例从14%提升至88%以上,并将平均解答长度从943词缩减至193词。