Random Wheeler Automata

Wheeler automata were introduced in 2017 as a tool to generalize existing indexing and compression techniques based on the Burrows-Wheeler transform. Intuitively, an automaton is said to be Wheeler if there exists a total order on its states reflecting the co-lexicographic order of the strings labeling the automaton's paths; this property makes it possible to represent the automaton's topology in a constant number of bits per transition, as well as efficiently solving pattern matching queries on its accepted regular language. After their introduction, Wheeler automata have been the subject of a prolific line of research, both from the algorithmic and language-theoretic points of view. A recurring issue faced in these studies is the lack of large datasets of Wheeler automata on which the developed algorithms and theories could be tested. One possible way to overcome this issue is to generate random Wheeler automata. Motivated by this observation, in this paper we initiate the theoretical study of random Wheeler automata, focusing on the deterministic case (Wheeler DFAs -- WDFAs). We start by extending the Erd\H{o}s-R\'enyi random graph model to WDFAs, and proceed by providing an algorithm generating uniform WDFAs according to this model. Our algorithm generates a uniform WDFA with $n$ states, $m$ transitions, and alphabet's cardinality $\sigma$ in $O(m)$ expected time ($O(m\log m)$ worst-case time w.h.p.) and constant working space for all alphabets of size $\sigma \le m/\ln m$. As a by-product, we also give formulas for the number of distinct WDFAs and obtain that $ n\sigma + (n - \sigma) \log \sigma$ bits are necessary and sufficient to encode a WDFA with $n$ states and alphabet of size $\sigma$, up to an additive $\Theta(n)$ term. We present an implementation of our algorithm and show that it is extremely fast in practice, with a throughput of over 8 million transitions per second.

翻译：Wheeler自动机于2017年被引入，作为基于Burrows-Wheeler变换的现有索引与压缩技术的推广工具。直观而言，若自动机的状态存在一种全序关系，能反映其路径上字符串的共词典序，则称该自动机为Wheeler自动机；该性质使得每个转移可用常数比特表示自动机的拓扑结构，并能在其可接受的正则语言上高效求解模式匹配查询。自提出以来，Wheeler自动机在算法与语言理论两个方向均催生了丰硕的研究成果。然而，此类研究中反复出现的一个难题是缺乏可用于测试所开发算法与理论的大规模Wheeler自动机数据集。解决该问题的一种可能途径是生成随机Wheeler自动机。基于此观察，本文首次对随机Wheeler自动机开展理论研究，重点关注确定性情形（Wheeler确定型有限自动机——WDFA）。我们首先将Erdős–Rényi随机图模型扩展到WDFA，继而提出一种依据该模型生成均匀分布WDFA的算法。对于具有n个状态、m个转移及字母表基数σ的自动机，该算法可在期望O(m)时间（最坏情况O(m log m)且高概率成立）及常数工作空间内生成均匀WDFA，适用于所有满足σ ≤ m/ln m的字母表规模。作为副产品，我们给出了不同WDFA数量的计数公式，并证明在附加Θ(n)项偏差下，nσ + (n - σ) log σ比特是编码具有n个状态及σ大小字母表的WDFA的充要条件。我们实现了该算法，实验表明其实际运行速度极快，吞吐量超过每秒800万个转移。

相关内容

Alphabet

关注 1

Alphabet is mostly a collection of companies. This newer Google is a bit slimmed down, with the companies that are pretty far afield of our main internet products contained in Alphabet instead.

https://abc.xyz/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务