Grammar inference for complex programming languages remains a significant challenge, as existing approaches fail to scale to real world datasets within practical time constraints. In our experiments, none of the state-of-the-art tools, including Arvada, Treevada and Kedavra were able to infer grammars for complex languages such as C, C++, and Java within 48 hours. Arvada and Treevada perform grammar inference directly on full-length input examples, which proves inefficient for large files commonly found in such languages. While Kedavra introduces data decomposition to create shorter examples for grammar inference, its lexical analysis still relies on the original inputs. Additionally, its strict no-overgeneralization constraint limits the construction of complex grammars. To overcome these limitations, we propose Crucio, which builds a decomposition forest to extract short examples for lexical and grammar inference via a distributional matrix. Experimental results show that Crucio is the only method capable of successfully inferring grammars for complex programming languages (where the number of nonterminals is up to 23x greater than in prior benchmarks) within reasonable time limits. On the prior simple benchmark, Crucio achieves an average recall improvement of 1.37x and 1.19x over Treevada and Kedavra, respectively, and improves F1 scores by 1.21x and 1.13x.
翻译:针对复杂编程语言的文法推断仍是一项重大挑战,现有方法无法在实际时间限制内扩展到真实世界数据集。在我们的实验中,包括Arvada、Treevada和Kedavra在内的所有前沿工具均未能在48小时内推断出C、C++和Java等复杂语言的文法。Arvada和Treevada直接在完整长度的输入示例上进行文法推断,这对于此类语言常见的大型文件而言效率低下。虽然Kedavra引入了数据分解以创建更短的示例进行文法推断,但其词法分析仍依赖于原始输入。此外,其严格的无过度泛化约束限制了复杂文法的构建。为克服这些局限,我们提出Crucio方法,该方法通过构建分解森林,借助分布矩阵提取简短示例进行词法与文法推断。实验结果表明,Crucio是唯一能在合理时间限制内成功推断复杂编程语言文法的方法(其非终结符数量可达先前基准测试的23倍)。在先前简单基准测试中,Crucio相比Treevada和Kedavra分别实现了平均召回率1.37倍和1.19倍的提升,并将F1分数提高了1.21倍和1.13倍。