Sentence semantic matching is a research hotspot in natural language processing, which is considerably significant in various key scenarios, such as community question answering, searching, chatbot, and recommendation. Since most of the advanced models directly model the semantic relevance among words between two sentences while neglecting the \textit{keywords} and \textit{intents} concepts of them, DC-Match is proposed to disentangle keywords from intents and utilizes them to optimize the matching performance. Although DC-Match is a simple yet effective method for semantic matching, it highly depends on the external NER techniques to identify the keywords of sentences, which limits the performance of semantic matching for minor languages since satisfactory NER tools are usually hard to obtain. In this paper, we propose to generally and flexibly resolve the text into multi concepts for multilingual semantic matching to liberate the model from the reliance on NER models. To this end, we devise a \underline{M}ulti-\underline{C}oncept \underline{P}arsed \underline{S}emantic \underline{M}atching framework based on the pre-trained language models, abbreviated as \textbf{MCP-SM}, to extract various concepts and infuse them into the classification tokens. We conduct comprehensive experiments on English datasets QQP and MRPC, and Chinese dataset Medical-SM. Besides, we experiment on Arabic datasets MQ2Q and XNLI, the outstanding performance further prove MCP-SM's applicability in low-resource languages.
翻译:句子语义匹配是自然语言处理中的一个研究热点,在社区问答、搜索、聊天机器人和推荐等关键场景中具有重要意义。由于大多数先进模型直接建模两个句子中单词之间的语义相关性,而忽略了其关键词和意图概念,因此提出了DC-Match,将关键词与意图解耦,并利用它们优化匹配性能。尽管DC-Match是一种简单有效的语义匹配方法,但它高度依赖外部NER技术来识别句子中的关键词,这限制了语义匹配在次要语言上的性能,因为通常难以获得满意的NER工具。本文提出了一种通用且灵活地将文本解析为多概念的方法,用于多语言语义匹配,从而使模型摆脱对NER模型的依赖。为此,我们设计了一个基于预训练语言模型的多概念解析语义匹配框架,简称MCP-SM,用于提取各种概念并将其注入分类标记中。我们在英文数据集QQP和MRPC以及中文数据集Medical-SM上进行了全面实验。此外,我们在阿拉伯语数据集MQ2Q和XNLI上进行了实验,其出色表现进一步证明了MCP-SM在低资源语言中的适用性。