Behaviour-Driven Development (BDD) suites accumulate step-text duplication whose maintenance cost is established in prior work. Existing detection techniques require running the tests (Binamungu et al., 2018-2023) or are confined to a single organisation (Irshad et al., 2020-2022), leaving a gap: a purely static, paraphrase-robust, step-level detector usable on any repository. We fill the gap with cukereuse, an open-source Python CLI combining exact hashing, Levenshtein ratio, and sentence-transformer embeddings in a layered pipeline, released alongside an empirical corpus of 347 public GitHub repositories, 23,667 parsed .feature files, and 1,113,616 Gherkin steps. The step-weighted exact-duplicate rate is 80.2 %; the median-repository rate is 58.6 % (Spearman rho = 0.51 with size). The top hybrid cluster groups 20.7k occurrences across 2.2k files. Against 1,020 pairs manually labelled by the three authors under a released rubric (inter-annotator Fleiss' kappa = 0.84 on a 60-pair overlap), we report precision, recall, and F1 with bootstrap 95 % CIs under two protocols: the primary rubric and a score-free second-pass relabelling. The strongest honest pair-level number is near-exact at F1 = 0.822 on score-free labels; the primary-rubric semantic F1 = 0.906 is inflated by a stratification artefact that pins recall at 1.000. Lexical baselines (SourcererCC-style, NiCad-style) reach primary F1 = 0.761 and 0.799. The paper also presents a CDN-structured critique of Gherkin (Cognitive Dimensions of Notations); eight of fourteen dimensions are rated problematic or unsupported. The tool, corpus, labelled pairs, rubric, and pipeline are released under permissive licences.
翻译:行为驱动开发(BDD)套件中积累的步骤文本重复,其维护成本已在先前工作中得到确认。现有检测技术需要运行测试(Binamungu等,2018-2023)或局限于单一组织(Irshad等,2020-2022),存在空白:缺乏一种纯静态、对复述鲁棒、可在任意仓库上使用的步骤级检测器。我们通过cukereuse填补了这一空白——这是一个开源Python命令行工具,采用分层流水线,结合精确哈希、莱文斯坦比率和sentence-transformer嵌入,同时发布了一个包含347个公开GitHub仓库、23,667个解析后的.feature文件以及1,113,616个Gherkin步骤的经验语料库。按步骤加权的精确重复率为80.2%;仓库中位数重复率为58.6%(与规模的相关斯皮尔曼rho=0.51)。最大的混合聚类包含跨越2,200个文件的20,700次出现。针对三位作者根据已发布的评分标准手动标注的1,020对(60对重叠上的标注者间弗莱斯kappa=0.84),我们报告了两种协议下的精确率、召回率和F1值及其bootstrap 95%置信区间:主要评分标准和无分数的第二轮重新标注。在无分数标签下,最可靠的成对级别近似精确F1=0.822;主要评分标准下的语义F1=0.906因分层人为因素而虚高(该因素将召回率固定在1.000)。基于词法的基线方法(SourcererCC风格、NiCad风格)主要F1分别达到0.761和0.799。本文还提出了一项基于符号表示认知维度(CDN)结构的Gherkin批评分析;十四个维度中有八个被评为有问题或缺乏支持。该工具、语料库、标注对、评分标准和流水线均以宽松许可证发布。