Regular expressions (regexes) are a powerful mechanism for solving string-matching problems. They are supported by all modern programming languages, and have been estimated to appear in more than a third of Python and JavaScript projects. Yet existing studies have focused mostly on one aspect of regex programming: readability. We know little about how developers perceive and program regexes, nor the difficulties that they face. In this paper, we provide the first study of the regex development cycle, with a focus on (1) how developers make decisions throughout the process, (2) what difficulties they face, and (3) how aware they are about serious risks involved in programming regexes. We took a mixed-methods approach, surveying 279 professional developers from a diversity of backgrounds (including top tech firms) for a high-level perspective, and interviewing 17 developers to learn the details about the difficulties that they face and the solutions that they prefer. In brief, regexes are hard. Not only are they hard to read, our participants said that they are hard to search for, hard to validate, and hard to document. They are also hard to master: the majority of our studied developers were unaware of critical security risks that can occur when using regexes, and those who knew of the risks did not deal with them in effective manners. Our findings provide multiple implications for future work, including semantic regex search engines for regex reuse and improved input generators for regex validation.
翻译:正则表达式(regex)是解决字符串匹配问题的强大机制。所有现代编程语言均支持该功能,据估计超过三分之一的Python和JavaScript项目使用了正则表达式。然而现有研究主要聚焦于正则表达式编程的一个方面:可读性。我们对开发者如何感知和编写正则表达式、以及他们所面临的困难知之甚少。本文首次对正则表达式开发周期展开研究,重点关注:(1)开发者在整个过程中的决策方式,(2)他们面临的具体困难,以及(3)对编程正则表达式所涉及重大风险的认知程度。我们采用混合研究方法,对279名来自不同背景(包括顶级科技公司)的专业开发者进行问卷调查以获取宏观视角,并对17名开发者进行深度访谈以了解其面临的具体困难及偏好的解决方案。简而言之,正则表达式是困难的。参与者表示,正则表达式不仅难以阅读,还难以搜索、难以验证、难以文档化。此外,它们也难以掌握:多数受访开发者对使用正则表达式时可能出现的重大安全风险缺乏认知,即便知晓风险的开发者也无法有效应对。我们的发现为未来研究提供了多重启示,包括开发用于正则表达式复用的语义搜索引擎,以及改进用于正则表达式验证的输入生成器。