Practicing conversations with large language models (LLMs) presents a promising alternative to traditional in-person language learning. However, most LLMs generate text at a near-native level of complexity, making them ill-suited for first and second-year beginner learners (CEFR: A1-A2). In this paper, we investigate whether controllable generation techniques can adapt LLM outputs to better support beginners. We evaluate these methods through both automatic metrics and a user study with university-level learners of Japanese. Our findings show that while prompting alone fails, controllable generation techniques can successfully improve output comprehensibility for beginner speakers (from 39.4% to 83.3%). We further introduce a new token-level evaluation metric, Token Miss Rate (TMR), that quantifies the proportion of incomprehensible tokens per utterance and correlates strongly with human judgments. To support future research in AI-assisted language learning, we release our code, models, annotation tools, and dataset.
翻译:利用大语言模型进行对话练习为传统面对面语言学习提供了一种前景广阔的替代方案。然而,大多数大语言模型生成的文本复杂度接近母语者水平,难以适用于第一、第二学年的初级学习者(CEFR:A1-A2级别)。本文研究了可控生成技术能否调整大语言模型的输出以更好地支持初学者。我们通过自动评估指标和针对大学日语学习者的用户研究对这些方法进行了评估。研究结果表明,虽然单纯使用提示方法效果有限,但可控生成技术能显著提升输出内容对初学者的可理解性(从39.4%提升至83.3%)。我们进一步提出了一种新的词元级评估指标——词元缺失率,该指标通过量化每个话语中不可理解词元的比例,与人类判断呈现出强相关性。为促进人工智能辅助语言学习领域的未来研究,我们公开了相关代码、模型、标注工具及数据集。