Large Multimodal Models (LMMs) have demonstrated exceptional performance across a wide range of domains. This paper explores their potential in pronunciation assessment tasks, with a particular focus on evaluating the capabilities of the Generative Pre-trained Transformer (GPT) model, specifically GPT-4o. Our study investigates its ability to process speech and audio for pronunciation assessment across multiple levels of granularity and dimensions, with an emphasis on feedback generation and scoring. For our experiments, we use the publicly available Speechocean762 dataset. The evaluation focuses on two key aspects: multi-level scoring and the practicality of the generated feedback. Scoring results are compared against the manual scores provided in the Speechocean762 dataset, while feedback quality is assessed using Large Language Models (LLMs). The findings highlight the effectiveness of integrating LMMs with traditional methods for pronunciation assessment, offering insights into the model's strengths and identifying areas for further improvement.
翻译:大型多模态模型(LMMs)已在众多领域展现出卓越性能。本文探讨了其在发音评估任务中的潜力,特别聚焦于评估生成式预训练Transformer(GPT)模型(具体为GPT-4o)的能力。我们的研究考察了该模型在处理语音和音频以进行多粒度、多维度发音评估方面的性能,重点关注反馈生成与评分。实验中,我们使用公开可用的Speechocean762数据集。评估主要关注两个关键方面:多层级评分及生成反馈的实用性。评分结果与Speechocean762数据集中的人工评分进行对比,而反馈质量则通过大型语言模型(LLMs)进行评估。研究结果突显了将大型多模态模型与传统发音评估方法相结合的有效性,既揭示了该模型的优势,也指出了有待进一步改进的领域。