Recent advancements in Large Language Models (LLMs) have shown promising results in music understanding and generation tasks. However, existing works remain confined to Western tonal traditions, offering little insight into whether current LLMs can handle structurally distinct low-resource musical traditions. We present the first systematic evaluation of LLM competence in South Asian classical music, a tradition governed by raga, tala-based melodic constraints that impose fundamentally different structural principles from Western harmony-driven music. We ground our evaluation in Hindustani classical theory and Bengali classical forms, including Rabindra and Nazrul Sangeet -- representative low-resource traditions within South Asian classical music. For music understanding evaluation, we introduce a 504-question-answer benchmark spanning raga grammar, cultural knowledge, and symbolic notation reasoning, evaluating 33 LLMs where frontier models such as Gemini 2.5 Pro achieve 85-90% accuracy, while most open-source models remain in the 23-40% range. For music generation, we design a five-level controlled prompting framework and find that even the strongest model produces stylistically faithful outputs only 40% of the time. These results reveal that structural validity and stylistic faithfulness in music generation are distinct objectives and highlight an open challenge for culturally grounded music modeling.
翻译:近期,大语言模型(LLMs)在音乐理解与生成任务中展现出令人瞩目的成果。然而,现有研究仍局限于西方调性传统,未能揭示当前LLMs能否处理结构迥异的低资源音乐传统。我们首次系统评估了大语言模型在南亚古典音乐中的能力——该传统以拉格(raga)和塔拉(tala)为基础的旋律约束体系,其根本结构原则与西方和声驱动音乐截然不同。我们将评估立足印度斯坦古典音乐理论及孟加拉古典音乐形式(包括泰戈尔歌曲与纳兹鲁尔歌曲)——这些是南亚古典音乐中具有代表性的低资源传统。在音乐理解评估方面,我们构建了包含504个问答对的基准数据集,涵盖拉格语法、文化知识与符号谱推理,对33个大语言模型进行了评测:其中前沿模型(如Gemini 2.5 Pro)达到85-90%的准确率,而多数开源模型仅处于23-40%区间。在音乐生成方面,我们设计了五级受控提示框架,发现即便最强模型也仅有40%的生成结果符合风格规范。这些结果表明:音乐生成中的结构有效性与风格忠实度是不同维度的目标,并凸显了文化根植型音乐建模面临的开放挑战。