Automatic crash bucketing is a crucial phase in the software development process for efficiently triaging bug reports. It generally consists in grouping similar reports through clustering techniques. However, with real-time streaming bug collection, systems are needed to quickly answer the question: What are the most similar bugs to a new one?, that is, efficiently find near-duplicates. It is thus natural to consider nearest neighbors search to tackle this problem and especially the well-known locality-sensitive hashing (LSH) to deal with large datasets due to its sublinear performance and theoretical guarantees on the similarity search accuracy. Surprisingly, LSH has not been considered in the crash bucketing literature. It is indeed not trivial to derive hash functions that satisfy the so-called locality-sensitive property for the most advanced crash bucketing metrics. Consequently, we study in this paper how to leverage LSH for this task. To be able to consider the most relevant metrics used in the literature, we introduce DeepLSH, a Siamese DNN architecture with an original loss function, that perfectly approximates the locality-sensitivity property even for Jaccard and Cosine metrics for which exact LSH solutions exist. We support this claim with a series of experiments on an original dataset, which we make available.
翻译:自动崩溃归组是软件开发过程中高效分类缺陷报告的关键阶段。该过程通常通过聚类技术将相似报告进行分组。然而,面对实时流式缺陷收集系统,需要快速回答以下问题:哪些现有缺陷与新增缺陷最为相似?即高效发现近似重复报告。因此,自然考虑采用最近邻搜索方法解决该问题,尤其是利用广为人知的局部敏感哈希(LSH)处理大规模数据集——因其具有亚线性性能优势及相似性搜索精度的理论保证。令人惊讶的是,当前崩溃归组研究中尚未引入LSH方法。事实上,针对最先进的崩溃归组度量标准,推导满足所谓局部敏感性质的哈希函数并非易事。为此,本文研究如何将LSH应用于该任务。为能采用文献中最相关的度量标准,我们提出DeepLSH——一种具有原创损失函数的孪生深度神经网络架构,即使对已有精确LSH解决方案的Jaccard和余弦度量,该架构也能完美逼近局部敏感性质。我们通过一系列实验验证这一论断,实验基于原创数据集(已公开提供)。