Smallest Suffixient Sets: Effectiveness, Resilience, and Calculation

A suffixient set is a novel combinatorial object that captures the essential information of repetitive strings in a way that, provided with a random access mechanism, supports various forms of pattern matching. In this paper, we study the size $χ$ of the smallest suffixient set as a repetitiveness measure. First, we study its sensitivity to various string operations. We show that $χ$ cannot increase by more than 2 after appending or prepending a character to the string. As a consequence, we are able to give simple linear-time online algorithms to compute smallest suffixient sets. We also show that, although reversing the string can increase $χ$ by an arbitrary $O(n)$ value, it always holds $χ(T)/χ(T^R)\le 2$. We also prove lower and upper bounds for the additive or multiplicative increase of $χ$ after applying arbitrary edit operations, or rotating the text. In particular, we show that the additive increase can be as large as $Ω(\sqrt{n})$ for all those operations. Secondly, we place $χ$ in between known repetitiveness measures. In particular, we show $χ= O(r)$ (where $r$ is the number of runs in the Burrows-Wheeler Transform of the string), that there are string families where $χ=o(v)$ (where $v$ is the size of the smallext lexicographic parse of the string), and that $χ$ is uncomparable to almost all reachable measures based on copy-paste mechanisms. In passing, we give precise bounds for $χ$ for some relevant string families, for example $χ\le σ+2$ on episturmian words over alphabets of size $σ$ (e.g., $χ\le 4$ on Fibonacci strings, for which we precisely characterize the only two smallest suffixient sets).

翻译：后缀集是一种新型组合对象，能够以支持随机访问机制的方式捕捉重复字符串的本质信息，从而辅助多种形式的模式匹配。本文研究作为重复性度量指标的最小后缀集规模 $χ$。首先分析其对各类字符串操作的敏感性：研究表明，在字符串末尾或开头追加一个字符后，$χ$ 的增量不超过2。据此，我们提出计算最小后缀集的简单线性时间在线算法。进一步发现，尽管反转字符串可能使 $χ$ 产生 $O(n)$ 量级的任意增大，但始终满足 $χ(T)/χ(T^R)\le 2$。对于任意编辑操作或文本旋转，我们给出了 $χ$ 增量的加减法上下界，特别证明了这些操作可导致 $Ω(\sqrt{n})$ 量级的加法增量。其次，将 $χ$ 置于已知重复性度量体系中进行定位：证明 $χ= O(r)$（其中 $r$ 为字符串Burrows-Wheeler变换的游程数），存在字符串族满足 $χ=o(v)$（$v$ 为字符串的最小字典解析规模），并指出 $χ$ 与几乎所有基于复制粘贴机制的可达度量指标不可比较。作为补充，为相关重要字符串族给出 $χ$ 的精确界限，例如在字母表规模为 $σ$ 的episturmian词上，$χ\le σ+2$（如Fibonacci字符串中 $χ\le 4$），并精确刻画了其仅有的两个最小后缀集。