We define a multi-scale metric $d_ρ$ on strings by aggregating angle distances between all $n$-gram count vectors with exponential weights $ρ^n$. We benchmark $d_ρ$ in DBSCAN clustering against edit and $n$-gram baselines, give a linear-time suffix-tree algorithm for evaluation, prove metric and stability properties (including robustness under tandem-repeat stutters), and characterize isometries.
翻译:我们通过聚合所有 $n$-gram 计数向量之间的角度距离,并赋予指数权重 $ρ^n$,定义了字符串上的多尺度度量 $d_ρ$。我们以DBSCAN聚类为基准,将 $d_ρ$ 与编辑距离和 $n$-gram 基线进行比较,给出了用于评估的线性时间后缀树算法,证明了度量性质和稳定性(包括在串联重复抖动下的鲁棒性),并刻画了其等距性。