Context. Behaviour-Driven Development (BDD) suites in Gherkin accumulate step-text duplication with documented maintenance cost. Prior detectors either require runnable tests or are single-organisation, leaving a gap: a static, paraphrase-robust, step-level detector and a public benchmark to calibrate it. Objective. We release (i) the largest cross-organisational BDD step corpus to date, (ii) a labelled pair-level calibration benchmark, and (iii) a four-strategy detector with a consolidation-savings model linking clusters to ISO/IEC 25010 maintainability sub-characteristics. Method. The corpus contains 347 public GitHub repositories, 23,667 .feature files, and 1,113,616 Gherkin steps, SPDX-tagged. The detector layers exact hashing, normalised Levenshtein, sentence-transformer cosine, and a Levenshtein-banded hybrid. Calibration uses 1,020 manually labelled step pairs under a released rubric (60-pair overlap, Fleiss kappa = 0.84). We report precision, recall, and F1 with bootstrap 95% CIs under the primary rubric and a score-free relabelling, and benchmark against SourcererCC-style and NiCad-style lexical baselines. Results. Step-weighted exact-duplicate rate is 80.2%; median-repository rate is 58.6% (Spearman rho = 0.51). The top hybrid cluster has 20,737 occurrences across 2,245 files. Near-exact reaches F1 = 0.822 on score-free labels; semantic F1 = 0.906 under the primary rubric reflects a disclosed stratification artefact. Lexical baselines reach F1 = 0.761 and 0.799. The savings model estimates 893,357 corpus-wide eliminable step occurrences; on the median repository 62.5% of step lines are eliminable.
翻译:背景。采用Gherkin语言的行为驱动开发(BDD)测试套件中,步骤文本重复问题会带来已知的维护成本。现有检测器要么需要可运行测试,要么仅限于单一组织内使用,导致存在技术空白:目前缺乏一个静态、抗释义的步骤级重复检测器及其配套的公开基准。目标。我们发布:(i)迄今最大规模的跨组织BDD步骤语料库,(ii)带标注的步骤对级校准基准,以及(iii)采用四策略融合的检测器,并建立聚合节省模型将聚类结果与ISO/IEC 25010可维护性子特性关联。方法。语料库包含347个公开GitHub仓库、23,667个.feature文件及1,113,616条Gherkin步骤(均标注SPDX许可证)。检测器采用精确哈希、标准化莱文斯坦距离、句子转换器余弦相似度及莱文斯坦带状混合四种策略。校准使用基于发布准则人工标注的1,020个步骤对(60对重叠标注,Fleiss kappa=0.84)。我们在主准则和免分数重标注两种设置下报告精确率、召回率及F1值(Bootstrap 95%置信区间),并对比SourcererCC风格和NiCad风格的词汇基线。结果。步骤加权精确重复率为80.2%;中位数仓库重复率为58.6%(Spearman rho=0.51)。最优混合聚类在2,245个文件中出现20,737次。近似重复检测在免分数标签上F1=0.822;语义检测在主准则下F1=0.906(反映已披露的分层人为效应)。词汇基线F1分别为0.761和0.799。节省模型估算整个语料库可消除893,357条步骤出现次数;中位数仓库中62.5%的步骤行可消除。