Beyond Accuracy: Towards a Robust Evaluation Methodology for AI Systems for Language Education

The rapid adoption of large language models in AI-powered language education has created an urgent need for evaluations that assess pedagogical effectiveness, particularly in language learning--one of the most common LLM use cases (Tamkin et al. 2024, Costa-Gomes et al. 2025). With only narrowly defined task-specific evaluations of AI system capabilities in second language (L2) education existing in the literature, we require more holistic approaches in this AI for education space. To address this gap, we introduce L2-Bench, a novel evaluation benchmark grounded in a validated "language learning experience designer" construct to assess AI capabilities across L2 education contexts. Our methodology integrates pedagogical theory, sociotechnical AI evaluation methods, and operationalizes a hierarchical taxonomy to structure an expert-curated dataset of over 1,000 authentic rubric-scored task-response pairs with measurement and scoring pipeline. We report the results of a pilot validation exercise (N = 39) on an initial sample of our dataset (tasks were validated as authentic [M = 4.23 out of 5], but criteria scores were lower [M = 3.94], with universally poor inter-annotator agreement despite good internal consistency), alongside the experimental design for our follow-up practitioner data validation study as we iterate and scale to the full dataset. Ultimately, this research not only offers methodological lessons towards a more context-specific AI evaluations ecosystem, but also works towards better design of reproducible evaluations for AI systems deployed to educational contexts.

翻译：大规模语言模型在AI赋能语言教育中的快速普及，催生了评估教学有效性的迫切需求，特别是在语言学习这一LLM最常见的应用场景中（Tamkin等，2024；Costa-Gomes等，2025）。鉴于现有文献中仅存在针对第二语言（L2）教育中AI系统能力的狭义任务特定评估，我们需要在教育人工智能领域采用更全面的方法。为填补这一空白，我们提出L2-Bench——一种基于已验证的"语言学习体验设计者"框架的新颖评估基准，用于评估AI在L2教育场景中的能力。我们的方法论整合了教学理论、社会技术性AI评估方法，并通过层级化分类法构建了一个包含1000余组专家策划的真实评价量表-任务响应对的测量与评分流水线数据集。我们报告了初始样本试点验证实验（N=39）的结果（任务被验证为真实有效[M=4.23/5]，但评分标准得分较低[M=3.94]，尽管内部一致性良好但标注者间信度普遍较低），同时介绍了后续实践者数据验证研究的设计方案——我们将持续迭代并扩展至完整数据集。最终，本研究不仅为构建更具情境特异性的AI评估生态系统提供了方法论启示，还致力于为部署于教育场景的AI系统设计更可复现的评估方案。