Community question answering (CQA) forums are Internet-based platforms where users ask questions about a topic and other expert users try to provide solutions. Many CQA forums such as Quora, Stackoverflow, Yahoo!Answer, StackExchange exist with a lot of user-generated data. These data are leveraged in automated CQA ranking systems where similar questions (and answers) are presented in response to the query of the user. In this work, we empirically investigate a few aspects of this domain. Firstly, in addition to traditional features like TF-IDF, BM25 etc., we introduce a BERT-based feature that captures the semantic similarity between the question and answer. Secondly, most of the existing research works have focused on features extracted only from the question part; features extracted from answers have not been explored extensively. We combine both types of features in a linear fashion. Thirdly, using our proposed concepts, we conduct an empirical investigation with different rank-learning algorithms, some of which have not been used so far in CQA domain. On three standard CQA datasets, our proposed framework achieves state-of-the-art performance. We also analyze importance of the features we use in our investigation. This work is expected to guide the practitioners to select a better set of features for the CQA retrieval task.
翻译:社区问答(CQA)论坛是基于互联网的平台,用户在此就某一话题提出问题,其他专家用户则试图提供解决方案。许多CQA论坛,如Quora、Stackoverflow、Yahoo!Answer、StackExchange等,都存在大量用户生成的数据。这些数据被用于自动化CQA排序系统中,该系统在响应用户查询时呈现相似的问题(和答案)。在本研究中,我们对此领域的若干方面进行了实证探究。首先,除传统的TF-IDF、BM25等特征外,我们引入了一种基于BERT的特征,用于捕捉问题与答案之间的语义相似性。其次,现有研究大多聚焦于仅从问题部分提取的特征;而从答案中提取的特征尚未得到广泛探索。我们将这两类特征以线性方式结合。第三,基于我们提出的概念,我们使用不同的排序学习算法进行了实证研究,其中一些算法此前在CQA领域尚未被采用。在三个标准CQA数据集上,我们提出的框架达到了最先进的性能。我们还分析了研究中使用的特征的重要性。本工作预计将指导从业者为CQA检索任务选择更优的特征集。