Reducing Maintenance Burden in Behaviour-Driven Development: A Paraphrase-Robust Duplicate-Step Detector with a 1.1M-Step Open Benchmark

from arxiv, 25 pages, 2 figures, 4 tables. Submitted to Information and Software Technology (Elsevier). Tool, corpus, labelled benchmark, and rubric released at https://github.com/amughalbscs16/cukereuse-release under Apache-2.0

Context. Behaviour-Driven Development (BDD) suites in Gherkin accumulate step-text duplication with documented maintenance cost. Prior detectors either require runnable tests or are single-organisation, leaving a gap: a static, paraphrase-robust, step-level detector and a public benchmark to calibrate it. Objective. We release (i) the largest cross-organisational BDD step corpus to date, (ii) a labelled pair-level calibration benchmark, and (iii) a four-strategy detector with a consolidation-savings model linking clusters to ISO/IEC 25010 maintainability sub-characteristics. Method. The corpus contains 347 public GitHub repositories, 23,667 .feature files, and 1,113,616 Gherkin steps, SPDX-tagged. The detector layers exact hashing, normalised Levenshtein, sentence-transformer cosine, and a Levenshtein-banded hybrid. Calibration uses 1,020 manually labelled step pairs under a released rubric (60-pair overlap, Fleiss kappa = 0.84). We report precision, recall, and F1 with bootstrap 95% CIs under the primary rubric and a score-free relabelling, and benchmark against SourcererCC-style and NiCad-style lexical baselines. Results. Step-weighted exact-duplicate rate is 80.2%; median-repository rate is 58.6% (Spearman rho = 0.51). The top hybrid cluster has 20,737 occurrences across 2,245 files. Near-exact reaches F1 = 0.822 on score-free labels; semantic F1 = 0.906 under the primary rubric reflects a disclosed stratification artefact. Lexical baselines reach F1 = 0.761 and 0.799. The savings model estimates 893,357 corpus-wide eliminable step occurrences; on the median repository 62.5% of step lines are eliminable.

翻译：背景。采用Gherkin语言的行为驱动开发（BDD）测试套件中，步骤文本重复问题会带来已知的维护成本。现有检测器要么需要可运行测试，要么仅限于单一组织内使用，导致存在技术空白：目前缺乏一个静态、抗释义的步骤级重复检测器及其配套的公开基准。目标。我们发布：（i）迄今最大规模的跨组织BDD步骤语料库，（ii）带标注的步骤对级校准基准，以及（iii）采用四策略融合的检测器，并建立聚合节省模型将聚类结果与ISO/IEC 25010可维护性子特性关联。方法。语料库包含347个公开GitHub仓库、23,667个.feature文件及1,113,616条Gherkin步骤（均标注SPDX许可证）。检测器采用精确哈希、标准化莱文斯坦距离、句子转换器余弦相似度及莱文斯坦带状混合四种策略。校准使用基于发布准则人工标注的1,020个步骤对（60对重叠标注，Fleiss kappa=0.84）。我们在主准则和免分数重标注两种设置下报告精确率、召回率及F1值（Bootstrap 95%置信区间），并对比SourcererCC风格和NiCad风格的词汇基线。结果。步骤加权精确重复率为80.2%；中位数仓库重复率为58.6%（Spearman rho=0.51）。最优混合聚类在2,245个文件中出现20,737次。近似重复检测在免分数标签上F1=0.822；语义检测在主准则下F1=0.906（反映已披露的分层人为效应）。词汇基线F1分别为0.761和0.799。节省模型估算整个语料库可消除893,357条步骤出现次数；中位数仓库中62.5%的步骤行可消除。