In the realm of Duplicate Bug Report Detection (DBRD), conventional methods primarily focus on statically analyzing bug databases, often disregarding the running time of the model. In this context, complex models, despite their high accuracy potential, can be time-consuming, while more efficient models may compromise on accuracy. To address this issue, we propose a transformer-based system designed to strike a balance between time efficiency and accuracy performance. The existing methods primarily address it as either a retrieval or classification task. However, our hybrid approach leverages the strengths of both models. By utilizing the retrieval model, we can perform initial sorting to reduce the candidate set, while the classification model allows for more precise and accurate classification. In our assessment of commonly used models for retrieval and classification tasks, sentence BERT and RoBERTa outperform other baseline models in retrieval and classification, respectively. To provide a comprehensive evaluation of performance and efficiency, we conduct rigorous experimentation on five public datasets. The results reveal that our system maintains accuracy comparable to a classification model, significantly outperforming it in time efficiency and only slightly behind a retrieval model in time, thereby achieving an effective trade-off between accuracy and efficiency.
翻译:在重复缺陷报告检测(DBRD)领域,传统方法主要聚焦于静态分析缺陷数据库,往往忽略模型的运行时间。在此背景下,复杂模型虽具备高精度潜力,但可能耗时较长,而高效模型则可能牺牲准确性。为解决这一问题,我们提出一种基于Transformer的系统,旨在协调时间效率与准确性能之间的平衡。现有方法主要将其视为检索或分类任务,而我们的混合方法则融合了两类模型的优势。通过利用检索模型进行初步排序以缩减候选集,分类模型则能实现更精确的判别。在对检索与分类任务常用模型的评估中,Sentence BERT和RoBERTa分别优于其他基线模型。为全面评估性能与效率,我们在五个公开数据集上进行了严格实验。结果表明,我们的系统在保持与分类模型相近准确率的同时,时间效率显著优于分类模型,且仅略逊于检索模型,从而在准确性与效率之间实现了有效权衡。