Voices of the Mountains: Deep Learning-Based Vocal Error Detection System for Kurdish Maqams

Maqam, a singing type, is a significant component of Kurdish music. A maqam singer receives training in a traditional face-to-face or through self-training. Automatic Singing Assessment (ASA) uses machine learning (ML) to provide the accuracy of singing styles and can help learners to improve their performance through error detection. Currently, the available ASA tools follow Western music rules. The musical composition requires all notes to stay within their expected pitch range from start to finish. The system fails to detect micro-intervals and pitch bends, so it identifies Kurdish maqam singing as incorrect even though the singer performs according to traditional rules. Kurdish maqam requires recognizing performance errors within microtonal spaces, which is beyond Western equal temperament. This research is the first attempt to address the mentioned gap. While many error types happen during singing, our focus is on pitch, rhythm, and modal stability errors in the context of Bayati-Kurd. We collected 50 songs from 13 vocalists ( 2-3 hours) and annotated 221 error spans (150 fine pitch, 46 rhythm, 25 modal drift). The data was segmented into 15,199 overlapping windows and converted to log-mel spectrograms. We developed a two-headed CNN-BiLSTM with attention mode to decide whether a window contains an error and to classify it based on the chosen errors. Trained for 20 epochs with early stopping at epoch 10, the model reached a validation macro-F1 of 0.468. On the full 50-song evaluation at a 0.750 threshold, recall was 39.4% and precision 25.8% . Within detected windows, type macro-F1 was 0.387, with F1 of 0.492 (fine pitch), 0.536 (rhythm), and 0.133 (modal drift); modal drift recall was 8.0%. The better performance on common error types shows that the method works, while the poor modal-drift recall shows that more data and balancing are needed.

翻译：木卡姆作为一种歌唱形式，是库尔德音乐的重要组成部分。木卡姆歌手通常通过传统的面对面教学或自学进行训练。自动歌唱评估（ASA）利用机器学习（ML）技术来评估歌唱风格的准确性，并可通过错误检测帮助学习者提升演唱水平。目前，现有的ASA工具均遵循西方音乐规则。该体系要求所有音符自始至终严格保持在预期音高范围内。由于系统无法检测微分音和滑音，导致即使歌手按照传统规则演唱，库尔德木卡姆仍会被判定为错误。库尔德木卡姆需要在微分音空间中识别演唱错误，这超越了西方十二平均律的范畴。本研究首次尝试填补这一空白。虽然演唱过程中会出现多种错误类型，但我们的研究重点在于Bayati-Kurd调式下的音高、节奏和调式稳定性错误。我们收集了13位歌手演唱的50首歌曲（总时长2-3小时），标注了221个错误片段（含150个精细音高错误、46个节奏错误、25个调式漂移错误）。数据被分割为15,199个重叠时间窗并转换为对数梅尔频谱图。我们开发了具有注意力机制的双头CNN-BiLSTM模型，用于判断时间窗是否包含错误，并根据选定错误类型进行分类。模型训练20个周期（第10周期早停），验证集宏观F1分数达到0.468。在0.750阈值下对全部50首歌曲进行评估，召回率为39.4%，精确率为25.8%。在检测到错误的窗口内，类型宏观F1分数为0.387，其中精细音高错误F1为0.492，节奏错误F1为0.536，调式漂移错误F1为0.133；调式漂移错误的召回率仅为8.0%。常见错误类型的较好检测效果表明该方法具有可行性，而调式漂移的低召回率则说明需要更多数据及样本平衡处理。