Bengali Intent Classification with Generative Adversarial BERT

Intent classification is a fundamental task in natural language understanding, aiming to categorize user queries or sentences into predefined classes to understand user intent. The most challenging aspect of this particular task lies in effectively incorporating all possible classes of intent into a dataset while ensuring adequate linguistic variation. Plenty of research has been conducted in the related domains in rich-resource languages like English. In this study, we introduce BNIntent30, a comprehensive Bengali intent classification dataset containing 30 intent classes. The dataset is excerpted and translated from the CLINIC150 dataset containing a diverse range of user intents categorized over 150 classes. Furthermore, we propose a novel approach for Bengali intent classification using Generative Adversarial BERT to evaluate the proposed dataset, which we call GAN-BnBERT. Our approach leverages the power of BERT-based contextual embeddings to capture salient linguistic features and contextual information from the text data, while the generative adversarial network (GAN) component complements the model's ability to learn diverse representations of existing intent classes through generative modeling. Our experimental results demonstrate that the GAN-BnBERT model achieves superior performance on the newly introduced BNIntent30 dataset, surpassing the existing Bi-LSTM and the stand-alone BERT-based classification model.

翻译：意图分类是自然语言理解中的基础任务，旨在将用户查询或句子分类为预定义类别以理解用户意图。该任务最具挑战性的方面在于有效整合所有可能的意图类别到数据集中，同时确保充分的语言多样性。在英语等资源丰富语言的相关领域已有大量研究。本研究引入了BNIntent30，一个包含30个意图类别的综合性孟加拉语意图分类数据集。该数据集摘录并翻译自包含150个类别、覆盖多样化用户意图的CLINIC150数据集。此外，我们提出了一种基于生成对抗BERT的孟加拉语意图分类新方法（称为GAN-BnBERT）以评估所提数据集。我们的方法利用基于BERT的上下文嵌入能力捕捉文本数据中的显著语言特征和上下文信息，生成对抗网络（GAN）组件则通过生成式建模补充模型学习现有意图类别多样表征的能力。实验结果表明，GAN-BnBERT模型在全新提出的BNIntent30数据集上取得了优于现有Bi-LSTM和独立BERT分类模型的卓越性能。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日