Deja Vu at Scale: Paraphrase-Robust Detection of Duplicate Gherkin Steps in Behaviour-Driven Software Testing with Sentence-Transformer Embeddings and a 1.1M-Step Open Benchmark

翻译：规模化重现：基于句子转换器嵌入和行为驱动软件测试中重复Gherkin步骤的仿写鲁棒性检测及110万步骤开源基准测试

Ali Hassaan Mughal,Noor Fatima,Muhammad Bilal

from arxiv, 28 pages, 2 figures, 4 tables. Submitted to Information and Software Technology (Elsevier). Tool, corpus, labelled benchmark, and rubric released at https://github.com/amughalbscs16/cukereuse-release under Apache-2.0

Context. Behaviour-Driven Development (BDD) suites in Gherkin accumulate step-text duplication with documented maintenance cost. Prior detectors either require runnable tests or are single-organisation, leaving a gap: a static, paraphrase-robust, step-level detector and a public benchmark to calibrate it. Objective. We release (i) the largest cross-organisational BDD step corpus to date, (ii) a labelled pair-level calibration benchmark, and (iii) a four-strategy detector with a consolidation-savings model linking clusters to ISO/IEC 25010 maintainability sub-characteristics. Method. The corpus contains 347 public GitHub repositories, 23,667 .feature files, and 1,113,616 Gherkin steps, SPDX-tagged. The detector layers exact hashing, normalised Levenshtein, sentence-transformer cosine, and a Levenshtein-banded hybrid. Calibration uses 1,020 manually labelled step pairs under a released rubric (60-pair overlap, Fleiss kappa = 0.84). We report precision, recall, and F1 with bootstrap 95% CIs under the primary rubric and a score-free relabelling, and benchmark against SourcererCC-style and NiCad-style lexical baselines. Results. Step-weighted exact-duplicate rate is 80.2%; median-repository rate is 58.6% (Spearman rho = 0.51). The top hybrid cluster has 20,737 occurrences across 2,245 files. Near-exact reaches F1 = 0.822 on score-free labels; semantic F1 = 0.906 under the primary rubric reflects a disclosed stratification artefact. Lexical baselines reach F1 = 0.761 and 0.799. The savings model estimates 893,357 corpus-wide eliminable step occurrences; on the median repository 62.5% of step lines are eliminable.

翻译：上下文：采用Gherkin语言的行为驱动开发（BDD）测试套件中，步骤文本的重复现象会带来有据可查的维护成本。现有检测工具要么需要可运行测试，要么仅适用于单一组织，导致以下空白：缺乏一个静态的、对仿写具有鲁棒性的步骤级检测器，以及用于校准该检测器的公开基准。目标：我们发布（i）迄今为止规模最大的跨组织BDD步骤语料库，（ii）标注好的配对级校准基准，以及（iii）包含四种策略的检测器，并附带一个将聚类结果与ISO/IEC 25010可维护性子特性关联的合并节约模型。方法：该语料库包含347个GitHub公开仓库、23,667个.feature文件及1,113,616个Gherkin步骤，并附带SPDX标签。检测器采用精确哈希、标准化Levenshtein距离、句子转换器余弦相似度及Levenshtein带状混合策略。校准使用基于公开评估准则的1,020个人工标注步骤对（60对重叠，Fleiss kappa=0.84）。我们在主评估准则及无分数重新标注下报告精确率、召回率及F1值（bootstrap 95%置信区间），并与SourcererCC风格及NiCad风格的词汇基线进行基准比较。结果：步骤加权精确重复率为80.2%，中位数仓库重复率为58.6%（Spearman rho=0.51）。最优混合聚类包含横跨2,245个文件的20,737次出现。基于无分数标签的近似重复F1值为0.822；主评估准则下的语义重复F1值为0.906，这反映了公开的分层伪影。词汇基线的F1值分别为0.761和0.799。节约模型估计整个语料库中有893,357个步骤可被消除；在中位数仓库中，62.5%的步骤行可被消除。