People have long hoped for a conversational system that can assist in real-life situations, and recent progress on large language models (LLMs) is bringing this idea closer to reality. While LLMs are often impressive in performance, their efficacy in real-world scenarios that demand expert knowledge remains unclear. LLMs are believed to hold the most potential and value in education, especially in the development of Artificial intelligence (AI) based virtual teachers capable of facilitating language learning. Our focus is centered on evaluating the efficacy of LLMs in the realm of education, specifically in the areas of spoken language learning which encompass phonetics, phonology, and second language acquisition. We introduce a new multiple-choice question dataset to evaluate the effectiveness of LLMs in the aforementioned scenarios, including understanding and application of spoken language knowledge. In addition, we investigate the influence of various prompting techniques such as zero- and few-shot method (prepending the question with question-answer exemplars), chain-of-thought (CoT, think step-by-step), in-domain exampler and external tools (Google, Wikipedia). We conducted large-scale evaluation on popular LLMs (20 distinct models) using these methods. We achieved significant performance improvements compared to the zero-shot baseline in the practical questions reasoning (GPT-3.5, 49.1% -> 63.1%; LLaMA2-70B-Chat, 42.2% -> 48.6%). We found that models of different sizes have good understanding of concepts in phonetics, phonology, and second language acquisition, but show limitations in reasoning for real-world problems. Additionally, we also explore preliminary findings on conversational communication.
翻译:人们长期以来一直期望能有一个在现实生活中提供帮助的对话系统,而大型语言模型(LLMs)的最新进展正使这一愿景更接近现实。尽管LLMs在性能上常常令人印象深刻,但在需要专业知识的真实场景中,它们的效果仍不明确。LLMs被认为在教育领域最具潜力和价值,尤其是在开发能够促进语言学习的基于人工智能(AI)的虚拟教师方面。我们的研究重点在于评估LLMs在教育领域的效能,特别是涉及语音学、音系学和第二语言习得的口语学习方面。我们引入了一个新的多项选择题数据集,以评估LLMs在上述场景中的有效性,包括对口语知识的理解与应用。此外,我们研究了各种提示技术的影响,例如零样本和少样本方法(在问题前添加问答示例)、思维链(CoT,逐步思考)、领域内示例以及外部工具(Google、Wikipedia)。我们使用这些方法对主流LLMs(20个不同模型)进行了大规模评估。在实际问题推理中,我们相比零样本基线取得了显著的性能提升(GPT-3.5,49.1% -> 63.1%;LLaMA2-70B-Chat,42.2% -> 48.6%)。我们发现,不同规模的模型对语音学、音系学和第二语言习得的概念有良好的理解,但在现实世界问题的推理中表现出局限性。此外,我们还初步探讨了关于对话交流的研究发现。