Community Question Answering (CQA) in different domains is growing at a large scale because of the availability of several platforms and huge shareable information among users. With the rapid growth of such online platforms, a massive amount of archived data makes it difficult for moderators to retrieve possible duplicates for a new question and identify and confirm existing question pairs as duplicates at the right time. This problem is even more critical in CQAs corresponding to large software systems like askubuntu where moderators need to be experts to comprehend something as a duplicate. Note that the prime challenge in such CQA platforms is that the moderators are themselves experts and are therefore usually extremely busy with their time being extraordinarily expensive. To facilitate the task of the moderators, in this work, we have tackled two significant issues for the askubuntu CQA platform: (1) retrieval of duplicate questions given a new question and (2) duplicate question confirmation time prediction. In the first task, we focus on retrieving duplicate questions from a question pool for a particular newly posted question. In the second task, we solve a regression problem to rank a pair of questions that could potentially take a long time to get confirmed as duplicates. For duplicate question retrieval, we propose a Siamese neural network based approach by exploiting both text and network-based features, which outperforms several state-of-the-art baseline techniques. Our method outperforms DupPredictor and DUPE by 5% and 7% respectively. For duplicate confirmation time prediction, we have used both the standard machine learning models and neural network along with the text and graph-based features. We obtain Spearman's rank correlation of 0.20 and 0.213 (statistically significant) for text and graph based features respectively.
翻译:社区问答(CQA)在不同领域正快速发展,这得益于多种平台的普及以及用户间海量可共享信息。随着此类在线平台的快速增长,大量存档数据使版主难以检索新问题的可能重复项,也难以在恰当时间识别并确认现有问题对是否为重复。这一问题在面向大型软件系统(如askubuntu)的CQA中尤为关键,因为版主需要具备专业知识才能理解何为重复。值得注意的是,此类CQA平台面临的核心挑战在于:版主本身即为专家,时间极其宝贵且成本高昂。为协助版主工作,本研究针对askubuntu CQA平台解决了两大关键问题:(1) 针对新问题检索重复问题;(2) 预测重复问题的确认时间。第一个任务聚焦于从问题池中为特定新提出问题检索重复问题;第二个任务则通过回归问题求解,对可能需长时间确认为重复的问题对进行排序。在重复问题检索中,我们提出一种利用文本与网络特征的孪生神经网络方法,其性能超越多种当前最优基线技术。该方法相比DupPredictor和DUPE分别提升5%和7%。在重复确认时间预测中,我们结合文本与图特征,采用标准机器学习模型及神经网络,分别获得0.20和0.213的斯皮尔曼秩相关系数(具有统计显著性)。