Duplicate Question Retrieval and Confirmation Time Prediction in Software Communities

Community Question Answering (CQA) in different domains is growing at a large scale because of the availability of several platforms and huge shareable information among users. With the rapid growth of such online platforms, a massive amount of archived data makes it difficult for moderators to retrieve possible duplicates for a new question and identify and confirm existing question pairs as duplicates at the right time. This problem is even more critical in CQAs corresponding to large software systems like askubuntu where moderators need to be experts to comprehend something as a duplicate. Note that the prime challenge in such CQA platforms is that the moderators are themselves experts and are therefore usually extremely busy with their time being extraordinarily expensive. To facilitate the task of the moderators, in this work, we have tackled two significant issues for the askubuntu CQA platform: (1) retrieval of duplicate questions given a new question and (2) duplicate question confirmation time prediction. In the first task, we focus on retrieving duplicate questions from a question pool for a particular newly posted question. In the second task, we solve a regression problem to rank a pair of questions that could potentially take a long time to get confirmed as duplicates. For duplicate question retrieval, we propose a Siamese neural network based approach by exploiting both text and network-based features, which outperforms several state-of-the-art baseline techniques. Our method outperforms DupPredictor and DUPE by 5% and 7% respectively. For duplicate confirmation time prediction, we have used both the standard machine learning models and neural network along with the text and graph-based features. We obtain Spearman's rank correlation of 0.20 and 0.213 (statistically significant) for text and graph based features respectively.

翻译：不同领域中的社区问答（CQA）正因多平台的可及性及用户间海量可共享信息而大规模增长。随着此类在线平台的快速发展，大量存档数据使得版主难以在恰当时间检索新问题的潜在重复项，并识别及确认现有问题对是否为重复问题。在针对大型软件系统（如askubuntu）的CQA中，此问题尤为关键——此类平台上的版主需具备专业知识才能判断重复问题。值得注意的是，此类CQA平台的主要挑战在于版主自身即为专家，因此通常时间极为紧张且精力成本极其高昂。为辅助版主工作，本研究针对askubuntu CQA平台解决两个关键问题：（1）基于新问题检索重复问题；（2）预测重复问题确认时间。在第一个任务中，我们聚焦于从问题池中检索特定新发布问题的重复问题。在第二个任务中，我们通过回归问题排序可能需长时间才能被确认为重复的问题对。针对重复问题检索，我们提出基于孪生神经网络的方法，融合文本特征与网络特征，其性能优于多种现有基线技术。本方法分别以5%和7%的幅度超越DupPredictor与DUPE模型。针对重复确认时间预测，我们结合标准机器学习模型与神经网络，并综合文本特征与图特征。基于文本特征与图特征的斯皮尔曼秩相关系数分别达到0.20与0.213（统计显著）。