In the domain of question-answering in NLP, the retrieval of Frequently Asked Questions (FAQ) is an important sub-area which is well researched and has been worked upon for many languages. Here, in response to a user query, a retrieval system typically returns the relevant FAQs from a knowledge-base. The efficacy of such a system depends on its ability to establish semantic match between the query and the FAQs in real-time. The task becomes challenging due to the inherent lexical gap between queries and FAQs, lack of sufficient context in FAQ titles, scarcity of labeled data and high retrieval latency. In this work, we propose a bi-encoder-based query-FAQ matching model that leverages multiple combinations of FAQ fields (like, question, answer, and category) both during model training and inference. Our proposed Multi-Field Bi-Encoder (MFBE) model benefits from the additional context resulting from multiple FAQ fields and performs well even with minimal labeled data. We empirically support this claim through experiments on proprietary as well as open-source public datasets in both unsupervised and supervised settings. Our model achieves around 27% and 20% better top-1 accuracy for the FAQ retrieval task on internal and open datasets, respectively over the best performing baseline.
翻译:[摘要]在自然语言处理问答领域中,常见问题(FAQ)检索是一个被广泛研究且已针对多种语言开展工作的子领域。该系统需根据用户查询,实时从知识库中返回相关FAQ条目,其有效性取决于在查询与FAQ之间建立语义匹配的能力。由于查询与FAQ之间存在固有的词汇鸿沟、FAQ标题缺乏充分上下文、标注数据稀缺以及检索延迟高等挑战,该任务颇具难度。本文提出一种基于双编码器的查询-FAQ匹配模型,该模型在训练和推理阶段均能利用FAQ多字段组合信息(如问题、答案和类别)。我们所提出的多字段双编码器(MFBE)模型得益于FAQ多字段提供的额外上下文,即使在标注数据极少的情况下也能表现优异。我们通过在无监督和监督场景下对私有数据集及开源公开数据集进行实验,实证支持了这一主张。在FAQ检索任务中,我们的模型在内部数据集和开放数据集上分别比最优基线实现了约27%和20%的top-1准确率提升。