Finding Duplicates in 1.1M BDD Steps: cukereuse, a Paraphrase-Robust Static Detector for Cucumber and Gherkin

from arxiv, 39 pages, 9 figures, 8 tables. Under review at Software Quality Journal. Tool, corpus, labelled benchmark, and rubric released at https://github.com/amughalbscs16/cukereuse-release under Apache-2.0

Behaviour-Driven Development (BDD) suites accumulate step-text duplication whose maintenance cost is established in prior work. Existing detection techniques require running the tests (Binamungu et al., 2018-2023) or are confined to a single organisation (Irshad et al., 2020-2022), leaving a gap: a purely static, paraphrase-robust, step-level detector usable on any repository. We fill the gap with cukereuse, an open-source Python CLI combining exact hashing, Levenshtein ratio, and sentence-transformer embeddings in a layered pipeline, released alongside an empirical corpus of 347 public GitHub repositories, 23,667 parsed .feature files, and 1,113,616 Gherkin steps. The step-weighted exact-duplicate rate is 80.2 %; the median-repository rate is 58.6 % (Spearman rho = 0.51 with size). The top hybrid cluster groups 20.7k occurrences across 2.2k files. Against 1,020 pairs manually labelled by the three authors under a released rubric (inter-annotator Fleiss' kappa = 0.84 on a 60-pair overlap), we report precision, recall, and F1 with bootstrap 95 % CIs under two protocols: the primary rubric and a score-free second-pass relabelling. The strongest honest pair-level number is near-exact at F1 = 0.822 on score-free labels; the primary-rubric semantic F1 = 0.906 is inflated by a stratification artefact that pins recall at 1.000. Lexical baselines (SourcererCC-style, NiCad-style) reach primary F1 = 0.761 and 0.799. The paper also presents a CDN-structured critique of Gherkin (Cognitive Dimensions of Notations); eight of fourteen dimensions are rated problematic or unsupported. The tool, corpus, labelled pairs, rubric, and pipeline are released under permissive licences.

翻译：行为驱动开发（BDD）套件中积累的步骤文本重复，其维护成本已在先前工作中得到确认。现有检测技术需要运行测试（Binamungu等，2018-2023）或局限于单一组织（Irshad等，2020-2022），存在空白：缺乏一种纯静态、对复述鲁棒、可在任意仓库上使用的步骤级检测器。我们通过cukereuse填补了这一空白——这是一个开源Python命令行工具，采用分层流水线，结合精确哈希、莱文斯坦比率和sentence-transformer嵌入，同时发布了一个包含347个公开GitHub仓库、23,667个解析后的.feature文件以及1,113,616个Gherkin步骤的经验语料库。按步骤加权的精确重复率为80.2%；仓库中位数重复率为58.6%（与规模的相关斯皮尔曼rho=0.51）。最大的混合聚类包含跨越2,200个文件的20,700次出现。针对三位作者根据已发布的评分标准手动标注的1,020对（60对重叠上的标注者间弗莱斯kappa=0.84），我们报告了两种协议下的精确率、召回率和F1值及其bootstrap 95%置信区间：主要评分标准和无分数的第二轮重新标注。在无分数标签下，最可靠的成对级别近似精确F1=0.822；主要评分标准下的语义F1=0.906因分层人为因素而虚高（该因素将召回率固定在1.000）。基于词法的基线方法（SourcererCC风格、NiCad风格）主要F1分别达到0.761和0.799。本文还提出了一项基于符号表示认知维度（CDN）结构的Gherkin批评分析；十四个维度中有八个被评为有问题或缺乏支持。该工具、语料库、标注对、评分标准和流水线均以宽松许可证发布。