Intent classification is a fundamental task in natural language understanding, aiming to categorize user queries or sentences into predefined classes to understand user intent. The most challenging aspect of this particular task lies in effectively incorporating all possible classes of intent into a dataset while ensuring adequate linguistic variation. Plenty of research has been conducted in the related domains in rich-resource languages like English. In this study, we introduce BNIntent30, a comprehensive Bengali intent classification dataset containing 30 intent classes. The dataset is excerpted and translated from the CLINIC150 dataset containing a diverse range of user intents categorized over 150 classes. Furthermore, we propose a novel approach for Bengali intent classification using Generative Adversarial BERT to evaluate the proposed dataset, which we call GAN-BnBERT. Our approach leverages the power of BERT-based contextual embeddings to capture salient linguistic features and contextual information from the text data, while the generative adversarial network (GAN) component complements the model's ability to learn diverse representations of existing intent classes through generative modeling. Our experimental results demonstrate that the GAN-BnBERT model achieves superior performance on the newly introduced BNIntent30 dataset, surpassing the existing Bi-LSTM and the stand-alone BERT-based classification model.
翻译:意图分类是自然语言理解中的基础任务,旨在将用户查询或句子分类为预定义类别以理解用户意图。该任务最具挑战性的方面在于有效整合所有可能的意图类别到数据集中,同时确保充分的语言多样性。在英语等资源丰富语言的相关领域已有大量研究。本研究引入了BNIntent30,一个包含30个意图类别的综合性孟加拉语意图分类数据集。该数据集摘录并翻译自包含150个类别、覆盖多样化用户意图的CLINIC150数据集。此外,我们提出了一种基于生成对抗BERT的孟加拉语意图分类新方法(称为GAN-BnBERT)以评估所提数据集。我们的方法利用基于BERT的上下文嵌入能力捕捉文本数据中的显著语言特征和上下文信息,生成对抗网络(GAN)组件则通过生成式建模补充模型学习现有意图类别多样表征的能力。实验结果表明,GAN-BnBERT模型在全新提出的BNIntent30数据集上取得了优于现有Bi-LSTM和独立BERT分类模型的卓越性能。