Idiom detection using Natural Language Processing (NLP) is the computerized process of recognizing figurative expressions within a text that convey meanings beyond the literal interpretation of the words. While idiom detection has seen significant progress across various languages, the Kurdish language faces a considerable research gap in this area despite the importance of idioms in tasks like machine translation and sentiment analysis. This study addresses idiom detection in Sorani Kurdish by approaching it as a text classification task using deep learning techniques. To tackle this, we developed a dataset containing 10,580 sentences embedding 101 Sorani Kurdish idioms across diverse contexts. Using this dataset, we developed and evaluated three deep learning models: KuBERT-based transformer sequence classification, a Recurrent Convolutional Neural Network (RCNN), and a BiLSTM model with an attention mechanism. The evaluations revealed that the transformer model, the fine-tuned BERT, consistently outperformed the others, achieving nearly 99% accuracy while the RCNN achieved 96.5% and the BiLSTM 80%. These results highlight the effectiveness of Transformer-based architectures in low-resource languages like Kurdish. This research provides a dataset, three optimized models, and insights into idiom detection, laying a foundation for advancing Kurdish NLP.
翻译:基于自然语言处理(NLP)的习语检测,是指通过计算机化过程识别文本中那些传达超出字面含义的比喻性表达。尽管习语检测在多种语言中已取得显著进展,且习语在机器翻译和情感分析等任务中至关重要,但库尔德语在这一领域仍存在相当大的研究空白。本研究将索拉尼库尔德语的习语检测视为一项文本分类任务,并采用深度学习技术加以解决。为此,我们构建了一个包含10,580个句子的数据集,其中嵌入了101个索拉尼库尔德语习语,覆盖了多样化的语境。利用该数据集,我们开发并评估了三种深度学习模型:基于KuBERT的Transformer序列分类模型、循环卷积神经网络(RCNN)以及带有注意力机制的双向长短期记忆网络(BiLSTM)模型。评估结果表明,基于Transformer的微调BERT模型始终优于其他模型,准确率接近99%,而RCNN的准确率为96.5%,BiLSTM为80%。这些结果凸显了基于Transformer的架构在库尔德语等低资源语言中的有效性。本研究提供了一个数据集、三种优化模型以及对习语检测的深入见解,为推进库尔德语NLP研究奠定了基础。