Agentic Jackal: Live Execution and Semantic Value Grounding for Text-to-JQL

Translating natural language into Jira Query Language (JQL) requires resolving ambiguous field references, instance-specific categorical values, and complex Boolean predicates. Single-pass LLMs cannot discover which categorical values (e.g., component names or fix versions) actually exist in a given Jira instance, nor can they verify generated queries against a live data source, limiting accuracy on paraphrased or ambiguous requests. No open, execution-based benchmark exists for mapping natural language to JQL. We introduce Jackal, the first large-scale, execution-based text-to-JQL benchmark comprising 100,000 validated NL-JQL pairs on a live Jira instance with over 200,000 issues. To establish baselines on Jackal, we propose Agentic Jackal, a tool-augmented agent that equips LLMs with live query execution via the Jira MCP server and JiraAnchor, a semantic retrieval tool that resolves natural-language mentions of categorical values through embedding-based similarity search. Among 9 frontier LLMs evaluated, single-pass models average only 43.4% execution accuracy on short natural-language queries, highlighting that text-to-JQL remains an open challenge. The agentic approach improves 7 of 9 models, with a 9.0% relative gain on the most linguistically challenging variant; in a controlled ablation isolating JiraAnchor, categorical-value accuracy rises from 48.7% to 71.7%, with component-field accuracy jumping from 16.9% to 66.2%. Our analysis identifies inherent semantic ambiguities, such as issue-type disambiguation and text-field selection, as the dominant failure modes rather than value-resolution errors, pointing to concrete directions for future work. We publicly release the benchmark, all agent transcripts, and evaluation code to support reproducibility.

翻译：将自然语言翻译为Jira查询语言（JQL）需要消解模糊的字段引用、实例特定的分类值以及复杂的布尔谓词。单次通过的大语言模型无法获知特定Jira实例中实际存在的分类值（如组件名称或修复版本），也无法针对实时数据源验证生成的查询语句，因而在释义或模糊请求场景下精度受限。当前尚未出现面向自然语言到JQL映射的、基于执行的公开基准测试。我们提出Jackal——首个大规模、基于执行的文本到JQL基准测试，包含在拥有超过20万个问题的实时Jira实例上生成的10万对经过验证的自然语言-JQL配对。为建立Jackal的基线性能，我们提出Agentic Jackal：一种工具增强型智能体，通过Jira MCP服务器为LLM配备实时查询执行能力，并集成JiraAnchor——一种基于嵌入相似性搜索来消解自然语言中分类值提及的语义检索工具。在9个前沿LLM的评估中，单次通过模型对简短自然语言查询的平均执行准确率仅为43.4%，表明文本到JQL仍是一个开放挑战。基于智能体的方法改进了7/9个模型，在语言最具挑战性的变体上相对提升9.0%；在隔离JiraAnchor的消融实验中，分类值准确率从48.7%提升至71.7%，组件字段准确率从16.9%跃升至66.2%。我们的分析表明，与值解析错误相比，类型消歧和文本字段选择等固有语义歧义是主要失效模式，这为未来工作指明了具体方向。我们公开了基准测试、所有智能体对话记录及评估代码以确保可复现性。