Context. Behaviour-Driven Development (BDD) suites in Gherkin accumulate step-text duplication with documented maintenance cost. Prior detectors either require runnable tests or are single-organisation, leaving a gap: a static, paraphrase-robust, step-level detector and a public benchmark to calibrate it. Objective. We release (i) the largest cross-organisational BDD step corpus to date, (ii) a labelled pair-level calibration benchmark, and (iii) a four-strategy detector with a consolidation-savings model linking clusters to ISO/IEC 25010 maintainability sub-characteristics. Method. The corpus contains 347 public GitHub repositories, 23,667 .feature files, and 1,113,616 Gherkin steps, SPDX-tagged. The detector layers exact hashing, normalised Levenshtein, sentence-transformer cosine, and a Levenshtein-banded hybrid. Calibration uses 1,020 manually labelled step pairs under a released rubric (60-pair overlap, Fleiss kappa = 0.84). We report precision, recall, and F1 with bootstrap 95% CIs under the primary rubric and a score-free relabelling, and benchmark against SourcererCC-style and NiCad-style lexical baselines. Results. Step-weighted exact-duplicate rate is 80.2%; median-repository rate is 58.6% (Spearman rho = 0.51). The top hybrid cluster has 20,737 occurrences across 2,245 files. Near-exact reaches F1 = 0.822 on score-free labels; semantic F1 = 0.906 under the primary rubric reflects a disclosed stratification artefact. Lexical baselines reach F1 = 0.761 and 0.799. The savings model estimates 893,357 corpus-wide eliminable step occurrences; on the median repository 62.5% of step lines are eliminable.
翻译:上下文:采用Gherkin语言的行为驱动开发(BDD)测试套件中,步骤文本的重复现象会带来有据可查的维护成本。现有检测工具要么需要可运行测试,要么仅适用于单一组织,导致以下空白:缺乏一个静态的、对仿写具有鲁棒性的步骤级检测器,以及用于校准该检测器的公开基准。目标:我们发布(i)迄今为止规模最大的跨组织BDD步骤语料库,(ii)标注好的配对级校准基准,以及(iii)包含四种策略的检测器,并附带一个将聚类结果与ISO/IEC 25010可维护性子特性关联的合并节约模型。方法:该语料库包含347个GitHub公开仓库、23,667个.feature文件及1,113,616个Gherkin步骤,并附带SPDX标签。检测器采用精确哈希、标准化Levenshtein距离、句子转换器余弦相似度及Levenshtein带状混合策略。校准使用基于公开评估准则的1,020个人工标注步骤对(60对重叠,Fleiss kappa=0.84)。我们在主评估准则及无分数重新标注下报告精确率、召回率及F1值(bootstrap 95%置信区间),并与SourcererCC风格及NiCad风格的词汇基线进行基准比较。结果:步骤加权精确重复率为80.2%,中位数仓库重复率为58.6%(Spearman rho=0.51)。最优混合聚类包含横跨2,245个文件的20,737次出现。基于无分数标签的近似重复F1值为0.822;主评估准则下的语义重复F1值为0.906,这反映了公开的分层伪影。词汇基线的F1值分别为0.761和0.799。节约模型估计整个语料库中有893,357个步骤可被消除;在中位数仓库中,62.5%的步骤行可被消除。