Accurate phoneme recognition is pivotal for mispronunciation detection and diagnosis (MDD) in modern standard Arabic (MSA), yet remains constrained by data scarcity and the synthetic-real domain gap. This work proposes a two-stage end-to-end framework. It integrates a pre-trained encoder with causal dilated temporal convolutional networks to preserve fine-grained phonetic variations. A hierarchical two-stage strategy first learns general mappings from native/synthetic corpora, then adapts to scarce real learner data to mitigate domain shift without over-correction. Prediction stability is further enhanced via multi-checkpoint ensemble inference with N-gram rescoring. Evaluated on the QuranMB.v2 test set, our system achieves an F1-score of $0.7201$, a $63.1$\% relative improvement over baseline ($0.4414$). This performance ranks at the top of the IqraEval.2 Challenge, establishing a new state-of-the-art for low-resource MSA in MDD.
翻译:暂无翻译