Sentence semantic matching is a research hotspot in natural language processing, which is considerably significant in various key scenarios, such as community question answering, searching, chatbot, and recommendation. Since most of the advanced models directly model the semantic relevance among words between two sentences while neglecting the \textit{keywords} and \textit{intents} concepts of them, DC-Match is proposed to disentangle keywords from intents and utilizes them to optimize the matching performance. Although DC-Match is a simple yet effective method for semantic matching, it highly depends on the external NER techniques to identify the keywords of sentences, which limits the performance of semantic matching for minor languages since satisfactory NER tools are usually hard to obtain. In this paper, we propose to generally and flexibly resolve the text into multi concepts for multilingual semantic matching to liberate the model from the reliance on NER models. To this end, we devise a \underline{M}ulti-\underline{C}oncept \underline{P}arsed \underline{S}emantic \underline{M}atching framework based on the pre-trained language models, abbreviated as \textbf{MCP-SM}, to extract various concepts and infuse them into the classification tokens. We conduct comprehensive experiments on English datasets QQP and MRPC, and Chinese dataset Medical-SM. Besides, we experiment on Arabic datasets MQ2Q and XNLI, the outstanding performance further prove MCP-SM's applicability in low-resource languages.
翻译:句子语义匹配是自然语言处理领域的一个研究热点,在社区问答、搜索、聊天机器人和推荐等多个关键场景中具有重要意义。由于大多数先进模型直接建模两个句子中词汇之间的语义关联性,而忽略了其中的“关键词”和“意图”概念,因此提出了DC-Match方法,将关键词从意图中分离出来,并利用它们优化匹配性能。尽管DC-Match是一种简单有效的语义匹配方法,但它高度依赖外部命名实体识别(NER)技术来识别句子中的关键词,这限制了其在次要语言中的语义匹配性能,因为通常难以获得令人满意的NER工具。本文提出了一种通用且灵活的方法,将文本解析为多概念,用于多语言语义匹配,从而使模型摆脱对NER模型的依赖。为此,我们设计了一种基于预训练语言模型的**多概念解析语义匹配框架**(简称**MCP-SM**),用于提取各种概念并将其融入分类令牌中。我们在英文数据集QQP和MRPC以及中文数据集Medical-SM上进行了全面实验。此外,还在阿拉伯语数据集MQ2Q和XNLI上进行了实验,其优异表现进一步证明了MCP-SM在低资源语言中的适用性。