Background: Ad hoc parsers are pieces of code that use common string functions like split, trim, or slice to effectively perform parsing. Whether it is handling command-line arguments, reading configuration files, parsing custom file formats, or any number of other minor string processing tasks, ad hoc parsing is ubiquitous -- yet poorly understood. Objective: This study aims to reveal the common syntactic and semantic characteristics of ad hoc parsing code in real world Python projects. Our goal is to understand the nature of ad hoc parsers in order to inform future program analysis efforts in this area. Method: We plan to conduct an exploratory study based on large-scale mining of open-source Python repositories from GitHub. We will use program slicing to identify program fragments related to ad hoc parsing and analyze these parsers and their surrounding contexts across 9 research questions using 25 initial syntactic and semantic metrics. Beyond descriptive statistics, we will attempt to identify common parsing patterns by cluster analysis.
翻译:背景:临时解析器是指使用split、trim或slice等常见字符串函数来有效执行解析任务的代码片段。无论是处理命令行参数、读取配置文件、解析自定义文件格式,还是其他各类次要字符串处理任务,临时解析都无处不在——但人们对它的理解却十分有限。目标:本研究旨在揭示真实世界Python项目中临时解析代码的常见句法和语义特征。我们的目标是理解临时解析器的本质,以便为该领域的未来程序分析工作提供参考。方法:我们计划基于从GitHub大规模挖掘的开源Python代码库开展一项探索性研究。我们将使用程序切片技术识别与临时解析相关的代码片段,并通过25个初始句法和语义度量指标,围绕9个研究问题对这些解析器及其上下文进行分析。除描述性统计外,我们还将尝试通过聚类分析识别常见的解析模式。