Bangla is among the world's most widely spoken languages, yet it remains underserved in educational NLP research. In many remote and rural regions, access to qualified subject teachers is limited, and written answers are consequently graded largely by hand, restricting timely and consistent feedback. Automatic assessment is challenging because semantically correct responses can vary substantially in surface form. We present a bilingual (Bangla-English) evaluation system designed for low-resource educational settings that prioritizes semantic correctness over lexical overlap. Our approach fine-tunes a lightweight language model to grade each response using the question, reference answer, and student answer, producing a numeric score and concise, context-grounded feedback suitable for classroom deployment. We also construct a synthetic bilingual dataset to enable controlled training and evaluation. Across proprietary and open-source LLMs evaluated under a unified protocol, our QLoRA-tuned Qwen3-8B confirms consistent improvement by producing the most leakage-resistant feedback (RoRa = 0.819) in synthetic evaluation and the strongest agreement with human scores (rho = 0.936, MAE = 0.725) in a dedicated human study.
翻译:孟加拉语是世界上使用最广泛的语言之一,但在教育领域的自然语言处理研究中仍处于服务不足的状态。在许多偏远和农村地区,合格的学科教师资源有限,因此书面答案主要依赖人工评分,这限制了及时且一致的反馈。由于语义上正确的回答可能在表面形式上存在显著差异,自动评分具有挑战性。我们提出了一种专为低资源教育环境设计的双语(孟加拉语-英语)评估系统,该系统优先考虑语义正确性而非词汇重叠。我们的方法微调了一个轻量级语言模型,利用问题、参考答案和学生答案对每个回答进行评分,生成一个数值分数和简洁的、基于上下文的反馈,适用于课堂部署。我们还构建了一个合成双语数据集以实现受控训练和评估。在统一协议下评估的专有和开源大语言模型中,我们使用QLoRA微调的Qwen3-8B在合成评估中产生了最具抗泄漏性的反馈(RoRa = 0.819),并在专门的人类研究中与人类评分达成了最强的一致性(rho = 0.936, MAE = 0.725),证实了其持续改进。