Duplicate Question Retrieval and Confirmation Time Prediction in Software Communities

Community Question Answering (CQA) in different domains is growing at a large scale because of the availability of several platforms and huge shareable information among users. With the rapid growth of such online platforms, a massive amount of archived data makes it difficult for moderators to retrieve possible duplicates for a new question and identify and confirm existing question pairs as duplicates at the right time. This problem is even more critical in CQAs corresponding to large software systems like askubuntu where moderators need to be experts to comprehend something as a duplicate. Note that the prime challenge in such CQA platforms is that the moderators are themselves experts and are therefore usually extremely busy with their time being extraordinarily expensive. To facilitate the task of the moderators, in this work, we have tackled two significant issues for the askubuntu CQA platform: (1) retrieval of duplicate questions given a new question and (2) duplicate question confirmation time prediction. In the first task, we focus on retrieving duplicate questions from a question pool for a particular newly posted question. In the second task, we solve a regression problem to rank a pair of questions that could potentially take a long time to get confirmed as duplicates. For duplicate question retrieval, we propose a Siamese neural network based approach by exploiting both text and network-based features, which outperforms several state-of-the-art baseline techniques. Our method outperforms DupPredictor and DUPE by 5% and 7% respectively. For duplicate confirmation time prediction, we have used both the standard machine learning models and neural network along with the text and graph-based features. We obtain Spearman's rank correlation of 0.20 and 0.213 (statistically significant) for text and graph based features respectively.

翻译：不同领域的社区问答（CQA）因多平台可用性和用户间海量可共享信息而大规模增长。随着此类在线平台的快速发展，大量积累的存档数据使版主难以检索新问题的可能重复项，并在恰当时间识别和确认现有问题对是否为重复。这一问题在askubuntu等大型软件系统的CQA中尤为关键，此类平台要求版主具备专业知识才能判定重复问题。值得注意的是，这类CQA平台的主要挑战在于版主本身就是专家，通常极为忙碌且时间成本极其高昂。为辅助版主工作，本研究针对askubuntu CQA平台解决了两个重要问题：（1）给定新问题时的重复问题检索；（2）重复问题确认时间预测。第一项任务专注于从问题池中检索特定新发布问题的重复项；第二项任务通过回归分析对可能需较长时间才能被确认为重复的问题对进行排序。针对重复问题检索，我们提出基于孪生神经网络的方法，综合利用文本特征和网络特征，性能优于多项最新基线技术。该方法分别比DupPredictor和DUPE提升5%和7%。针对重复确认时间预测，我们同时采用标准机器学习模型和神经网络，结合文本与图结构特征。基于文本和基于图结构的特征分别获得0.20和0.213的斯皮尔曼秩相关系数（具有统计显著性）。