Accessing quality preparation and feedback for the Romanian Bacalaureat exam is challenging, particularly for students in remote or underserved areas. This paper presents BacPrep, an experimental online platform exploring Large Language Model (LLM) potential for automated assessment, aiming to offer a free, accessible resource. Using official exam questions from the last 5 years, BacPrep employs the latest available Gemini Flash model (currently Gemini 2.5 Flash, via the \texttt{gemini-flash-latest} endpoint) to prioritize user experience quality during the data collection phase, with model versioning to be locked for subsequent rigorous evaluation. The platform has collected over 100 student solutions across Computer Science and Romanian Language exams, enabling preliminary assessment of LLM grading quality. This revealed several significant challenges: grading inconsistency across multiple runs, arithmetic errors when aggregating fractional scores, performance degradation under large prompt contexts, failure to apply subject-specific rubric weightings, and internal inconsistencies between generated scores and qualitative feedback. These findings motivate a redesigned architecture featuring subject-level prompt decomposition, specialized per-subject graders, and a median-selection strategy across multiple runs. Expert validation against human-graded solutions remains the critical next step.
翻译:获取罗马尼亚高中毕业会考(Bacalaureat)的高质量备考资料与反馈颇具挑战,尤其对偏远或资源匮乏地区的学生而言。本文介绍BacPrep——一个实验性在线平台,旨在探索大语言模型(LLM)在自动评估中的潜力,以提供免费且易用的学习资源。BacPrep采用近5年官方试题,利用最新可用Gemini Flash模型(当前为Gemini 2.5 Flash,通过\texttt{gemini-flash-latest}接口调用)优先保障数据收集阶段的用户体验质量,后续将锁定模型版本以进行严格评估。该平台已收集超过100份计算机科学和罗马尼亚语科目的学生答卷,得以初步评估LLM的评分质量。研究揭示了若干重大挑战:多次评分结果不一致、汇总分段分数时出现算术错误、长提示上下文导致性能下降、未按科目特定评分标准加权、以及生成的分数与定性反馈之间存在内部矛盾。这些发现促使我们重新设计架构,采用按科目分解提示的策略、针对各科目的专门评分器,以及跨多次运行取中位数的选择机制。下一步关键工作仍为:通过专家对人工评阅试卷的验证,评估系统效果。