We propose \emph{regular expression inference (REI)} as a challenge for code/language modelling, and the wider machine learning community. REI is a supervised machine learning (ML) and program synthesis task, and poses the problem of finding minimal regular expressions from examples: Given two finite sets of strings $P$ and $N$ and a cost function $\text{cost}(\cdot)$, the task is to generate an expression $r$ that accepts all strings in $P$ and rejects all strings in $N$, while no other such expression $r'$ exists with $\text{cost}(r')<\text{cost}(r)$. REI has advantages as a challenge problem: (i) regular expressions are well-known, widely used, and a natural idealisation of code; (ii) REI's asymptotic worst-case complexity is well understood; (iii) REI has a small number of easy to understand parameters (e.g.~$P$ or $N$ cardinality, string lengths of examples, or the cost function); this lets us easily finetune REI-hardness; (iv) REI is an unsolved problem for deep learning based ML. Recently, an REI solver was implemented on GPUs, using program synthesis techniques. This enabled, for the first time, fast generation of minimal expressions for complex REI instances. Building on this advance, we generate and publish the first large-scale datasets for REI, and devise and evaluate several initial heuristic and machine learning baselines. We invite the community to participate and explore ML methods that learn to solve REI problems. We believe that progress in REI directly translates to code/language modelling.
翻译:我们提出将正则表达式推断(Regular Expression Inference, REI)作为代码/语言建模以及更广泛机器学习社区的一项挑战。REI是一项有监督的机器学习(ML)与程序合成任务,其问题在于从示例中寻找最小正则表达式:给定两个有限字符串集合$P$和$N$,以及一个代价函数$\text{cost}(\cdot)$,目标是生成一个表达式$r$,使得该表达式接受$P$中的所有字符串并拒绝$N$中的所有字符串,同时不存在其他满足条件的表达式$r'$使得$\text{cost}(r')<\text{cost}(r)$。REI作为挑战问题具有以下优势:(i)正则表达式众所周知、广泛使用,且是代码的自然理想化形式;(ii)REI的渐近最坏情况复杂度已被充分理解;(iii)REI具有少量易于理解的参数(如$P$或$N$的基数、示例字符串长度或代价函数),这使我们能够轻松调整REI的难度;(iv)对于基于深度学习的ML而言,REI仍未得到解决。近期,利用程序合成技术,一个REI求解器在GPU上实现。这首次实现了对复杂REI实例的最小表达式快速生成。基于这一进展,我们生成并发布了首个大规模REI数据集,并设计并评估了若干初步的启发式方法与机器学习基线。我们邀请社区参与并探索学习求解REI问题的机器学习方法。我们相信,REI领域的进展将直接转化为代码/语言建模的进步。