Retrieval and Generative Approaches for a Pregnancy Chatbot in Nepali with Stemmed and Non-Stemmed Data : A Comparative Study

from arxiv, 7 pages, 5 figures, 4 tables. In proceedings of the International Conference on Technologies for Computer, Electrical, Electronics & Communication (ICT-CEEL 2023), Bhaktapur, Nepal

The field of Natural Language Processing which involves the use of artificial intelligence to support human languages has seen tremendous growth due to its high-quality features. Its applications such as language translation, chatbots, virtual assistants, search autocomplete, and autocorrect are widely used in various domains including healthcare, advertising, customer service, and target advertising. To provide pregnancy-related information a health domain chatbot has been proposed and this work explores two different NLP-based approaches for developing the chatbot. The first approach is a multiclass classification-based retrieval approach using BERTbased multilingual BERT and multilingual DistilBERT while the other approach employs a transformer-based generative chatbot for pregnancy-related information. The performance of both stemmed and non-stemmed datasets in Nepali language has been analyzed for each approach. The experimented results indicate that BERT-based pre-trained models perform well on non-stemmed data whereas scratch transformer models have better performance on stemmed data. Among the models tested the DistilBERT model achieved the highest training and validation accuracy and testing accuracy of 0.9165 on the retrieval-based model architecture implementation on the non-stemmed dataset. Similarly, in the generative approach architecture implementation with transformer 1 gram BLEU and 2 gram BLEU scores of 0.3570 and 0.1413 respectively were achieved.

翻译：自然语言处理领域涉及利用人工智能支持人类语言，因其高质量特性而经历了巨大发展。其应用如语言翻译、聊天机器人、虚拟助手、搜索自动补全和自动纠错等，已广泛应用于医疗、广告、客户服务和定向广告等多个领域。为提供孕期相关信息，本文提出了一种健康领域聊天机器人，并探讨了两种基于自然语言处理的不同开发方法。第一种是基于多分类的检索方法，使用了基于BERT的多语言BERT和多语言DistilBERT；另一种方法则采用基于Transformer的生成式聊天机器人进行孕期信息交互。针对每种方法，分析了尼泊尔语中词干化与非词干化数据集的性能表现。实验结果表明，基于BERT的预训练模型在非词干化数据上表现优异，而从头训练的Transformer模型在词干化数据上具有更优性能。在测试模型中，DistilBERT在非词干化数据集上基于检索的模型架构实现中取得了最高训练准确率、验证准确率和测试准确率（0.9165）。类似地，在基于Transformer的生成式架构实现中，1-gram BLEU得分为0.3570，2-gram BLEU得分为0.1413。