Translating natural language into Bash Commands is an emerging research field that has gained attention in recent years. Most efforts have focused on producing more accurate translation models. To the best of our knowledge, only two datasets are available, with one based on the other. Both datasets involve scraping through known data sources (through platforms like stack overflow, crowdsourcing, etc.) and hiring experts to validate and correct either the English text or Bash Commands. This paper provides two contributions to research on synthesizing Bash Commands from scratch. First, we describe a state-of-the-art translation model used to generate Bash Commands from the corresponding English text. Second, we introduce a new NL2CMD dataset that is automatically generated, involves minimal human intervention, and is over six times larger than prior datasets. Since the generation pipeline does not rely on existing Bash Commands, the distribution and types of commands can be custom adjusted. We evaluate the performance of ChatGPT on this task and discuss the potential of using it as a data generator. Our empirical results show how the scale and diversity of our dataset can offer unique opportunities for semantic parsing researchers.
翻译:将自然语言翻译为Bash命令是近年来备受关注的新兴研究领域。现有研究主要聚焦于提升翻译模型的准确性。据我们所知,目前仅有两个可用数据集,且其中一个基于另一个构建。这两个数据集均需通过已知数据源(如Stack Overflow、众包平台等)进行爬取,并雇佣专家验证和纠正英文文本或Bash命令。本文在从头合成Bash命令的研究方面做出两项贡献:首先,我们描述了一种用于从对应英文文本生成Bash命令的最先进翻译模型;其次,我们引入了一个全新自动生成的NL2CMD数据集,该数据集仅需最少人工干预,且规模较现有数据集扩大六倍以上。由于生成流程不依赖现有Bash命令,命令的分布和类型均可自定义调整。我们评估了ChatGPT在此任务上的表现,并探讨其作为数据生成工具的潜力。实证结果表明,本数据集的规模与多样性可为语义解析研究者提供独特机遇。