Idiom Detection in Sorani Kurdish Texts

Idiom detection using Natural Language Processing (NLP) is the computerized process of recognizing figurative expressions within a text that convey meanings beyond the literal interpretation of the words. While idiom detection has seen significant progress across various languages, the Kurdish language faces a considerable research gap in this area despite the importance of idioms in tasks like machine translation and sentiment analysis. This study addresses idiom detection in Sorani Kurdish by approaching it as a text classification task using deep learning techniques. To tackle this, we developed a dataset containing 10,580 sentences embedding 101 Sorani Kurdish idioms across diverse contexts. Using this dataset, we developed and evaluated three deep learning models: KuBERT-based transformer sequence classification, a Recurrent Convolutional Neural Network (RCNN), and a BiLSTM model with an attention mechanism. The evaluations revealed that the transformer model, the fine-tuned BERT, consistently outperformed the others, achieving nearly 99% accuracy while the RCNN achieved 96.5% and the BiLSTM 80%. These results highlight the effectiveness of Transformer-based architectures in low-resource languages like Kurdish. This research provides a dataset, three optimized models, and insights into idiom detection, laying a foundation for advancing Kurdish NLP.

翻译：基于自然语言处理（NLP）的习语检测，是指通过计算机化过程识别文本中那些传达超出字面含义的比喻性表达。尽管习语检测在多种语言中已取得显著进展，且习语在机器翻译和情感分析等任务中至关重要，但库尔德语在这一领域仍存在相当大的研究空白。本研究将索拉尼库尔德语的习语检测视为一项文本分类任务，并采用深度学习技术加以解决。为此，我们构建了一个包含10,580个句子的数据集，其中嵌入了101个索拉尼库尔德语习语，覆盖了多样化的语境。利用该数据集，我们开发并评估了三种深度学习模型：基于KuBERT的Transformer序列分类模型、循环卷积神经网络（RCNN）以及带有注意力机制的双向长短期记忆网络（BiLSTM）模型。评估结果表明，基于Transformer的微调BERT模型始终优于其他模型，准确率接近99%，而RCNN的准确率为96.5%，BiLSTM为80%。这些结果凸显了基于Transformer的架构在库尔德语等低资源语言中的有效性。本研究提供了一个数据集、三种优化模型以及对习语检测的深入见解，为推进库尔德语NLP研究奠定了基础。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日