Stack Trace Deduplication: Faster, More Accurately, and in More Realistic Scenarios

In large-scale software systems, there are often no fully-fledged bug reports with human-written descriptions when an error occurs. In this case, developers rely on stack traces, i.e., series of function calls that led to the error. Since there can be tens and hundreds of thousands of them describing the same issue from different users, automatic deduplication into categories is necessary to allow for processing. Recent works have proposed powerful deep learning-based approaches for this, but they are evaluated and compared in isolation from real-life workflows, and it is not clear whether they will actually work well at scale. To overcome this gap, this work presents three main contributions: a novel model, an industry-based dataset, and a multi-faceted evaluation. Our model consists of two parts - (1) an embedding model with byte-pair encoding and approximate nearest neighbor search to quickly find the most relevant stack traces to the incoming one, and (2) a reranker that re-ranks the most fitting stack traces, taking into account the repeated frames between them. To complement the existing datasets collected from open-source projects, we share with the community SlowOps - a dataset of stack traces from IntelliJ-based products developed by JetBrains, which has an order of magnitude more stack traces per category. Finally, we carry out an evaluation that strives to be realistic: measuring not only the accuracy of categorization, but also the operation time and the ability to create new categories. The evaluation shows that our model strikes a good balance - it outperforms other models on both open-source datasets and SlowOps, while also being faster on time than most. We release all of our code and data, and hope that our work can pave the way to further practice-oriented research in the area.

翻译：在大型软件系统中，错误发生时往往不存在包含人工编写描述的完整错误报告。在这种情况下，开发者需要依赖堆栈跟踪，即导致错误的一系列函数调用。由于描述同一问题的堆栈跟踪可能来自不同用户，数量可达数万甚至数十万，因此必须通过自动去重将其分类处理，以便进行分析。近期研究提出了基于深度学习的强大方法来解决这一问题，但这些方法在评估和比较时脱离了实际工作流程，其在大规模场景下的实际效果尚不明确。为弥补这一不足，本研究提出了三个主要贡献：一种新颖的模型、一个基于工业场景的数据集，以及一个多维度的评估体系。我们的模型包含两个部分：(1) 采用字节对编码与近似最近邻搜索的嵌入模型，用于快速查找与输入堆栈跟踪最相关的现有记录；(2) 考虑重复调用帧的重排序器，对最匹配的堆栈跟踪进行重新排序。为补充现有开源项目数据集，我们向社区发布了SlowOps数据集，该数据集包含来自JetBrains开发的IntelliJ系列产品的堆栈跟踪，其每个类别的堆栈跟踪数量比现有数据集高出一个数量级。最后，我们开展了力求贴近实际的评估：不仅衡量分类准确性，同时测量运行时间及创建新类别的能力。评估结果表明，我们的模型实现了良好平衡——在开源数据集和SlowOps数据集上均优于其他模型，同时运行速度也快于大多数现有模型。我们公开了所有代码与数据，希望本研究能为该领域进一步面向实践的研究铺平道路。