Translating natural language into Bash Commands is an emerging research field that has gained attention in recent years. Most efforts have focused on producing more accurate translation models. To the best of our knowledge, only two datasets are available, with one based on the other. Both datasets involve scraping through known data sources (through platforms like stack overflow, crowdsourcing, etc.) and hiring experts to validate and correct either the English text or Bash Commands. This paper provides two contributions to research on synthesizing Bash Commands from scratch. First, we describe a state-of-the-art translation model used to generate Bash Commands from the corresponding English text. Second, we introduce a new NL2CMD dataset that is automatically generated, involves minimal human intervention, and is over six times larger than prior datasets. Since the generation pipeline does not rely on existing Bash Commands, the distribution and types of commands can be custom adjusted. Our empirical results show how the scale and diversity of our dataset can offer unique opportunities for semantic parsing researchers.
翻译:将自然语言翻译为Bash命令是近年来受到关注的新兴研究领域。现有工作大多聚焦于提升翻译模型的准确性。据我们所知,目前仅有两个可用数据集,且其中一个基于另一个构建。这两个数据集均通过爬取已知数据源(如Stack Overflow、众包平台等)获取数据,并聘请专家对英文文本或Bash命令进行验证与修正。本文为从零合成Bash命令的研究提供两项贡献:首先,我们描述了一种从对应英文文本生成Bash命令的最先进翻译模型;其次,我们提出一个新的NL2CMD数据集,该数据集可自动生成,人工干预极小,且规模是先前数据集的六倍以上。由于生成流程不依赖现有Bash命令,命令的分布与类型可进行自定义调整。实证结果表明,我们数据集的规模与多样性能够为语义解析研究者提供独特的研究机遇。