In the domain of question-answering in NLP, the retrieval of Frequently Asked Questions (FAQ) is an important sub-area which is well researched and has been worked upon for many languages. Here, in response to a user query, a retrieval system typically returns the relevant FAQs from a knowledge-base. The efficacy of such a system depends on its ability to establish semantic match between the query and the FAQs in real-time. The task becomes challenging due to the inherent lexical gap between queries and FAQs, lack of sufficient context in FAQ titles, scarcity of labeled data and high retrieval latency. In this work, we propose a bi-encoder-based query-FAQ matching model that leverages multiple combinations of FAQ fields (like, question, answer, and category) both during model training and inference. Our proposed Multi-Field Bi-Encoder (MFBE) model benefits from the additional context resulting from multiple FAQ fields and performs well even with minimal labeled data. We empirically support this claim through experiments on proprietary as well as open-source public datasets in both unsupervised and supervised settings. Our model achieves around 27% and 20% better top-1 accuracy for the FAQ retrieval task on internal and open datasets, respectively over the best performing baseline.
翻译:在自然语言处理的问答领域中,常见问题(FAQ)的检索是一个重要子领域,已得到广泛研究并在多种语言上进行了实践。针对用户查询,检索系统通常从知识库中返回相关的FAQ。此类系统的有效性取决于其能否实时建立查询与FAQ之间的语义匹配。由于查询与FAQ之间存在固有的词汇鸿沟、FAQ标题缺乏足够的上下文、标注数据稀缺以及检索延迟较高,该任务具有挑战性。在本文中,我们提出了一种基于双编码器的查询-FAQ匹配模型,该模型在模型训练和推理过程中均利用FAQ字段(如问题、答案和类别)的多种组合。我们提出的多字段双编码器(MFBE)模型得益于多个FAQ字段带来的额外上下文,即使在标注数据极少的情况下也能表现出色。我们通过在专有数据集和开源公共数据集上的无监督及有监督实验,实证支持了这一论点。在FAQ检索任务中,我们的模型在内部数据集和开放数据集上的top-1准确率分别比最佳基线提升了约27%和20%。