jXBW: A Compressed Index for Structure-Aware JSONL Retrieval in Structured RAG

Providing \textit{structured} information to large language models (LLMs) improves multi-step reasoning and factual grounding, and recent retrieval-augmented generation (RAG) systems therefore reconstruct structure from retrieved text on every query. When the corpus is \emph{already} structured -- as in JSON Lines (JSONL), a popular format for LLM prompts, chemical compounds, and geospatial records -- this per-query rebuilding can be replaced by direct \emph{structural retrieval}. The core primitive is \textit{substructure search}: finding all JSON objects in a collection that contain a given query pattern. Existing approaches index each document separately, so both index space and query time grow with the total collection size; XML-based engines add conversion overhead and semantic mismatches. We propose \textbf{jXBW}, a compressed index for fast substructure search over JSONL, combining three innovations: (i) a merged tree representation that consolidates repeated structures across objects, (ii) a succinct tree index based on the eXtended Burrows--Wheeler Transform (XBW), and (iii) a newly developed three-phase substructure search algorithm that runs on this index. Together they achieve \textbf{query-dependent complexity}: the cost is determined by query characteristics rather than collection size, in compressed space. Experiments on seven real-world datasets, including PubChem ($10^6$ compounds) and OpenStreetMap ($6.6 \times 10^6$ objects), show that jXBW outperforms the strongest tree-based baseline by $\mathbf{16\times}$ on the smallest dataset and by up to $\mathbf{2{,}800\times}$ on the largest, and is more than $\mathbf{2 \times 10^6\times}$ faster than the XQuery engine Saxon. jXBW thus brings structural retrieval over million-record JSONL collections into the sub-millisecond range.

翻译：为大型语言模型提供\textit{结构化}信息可提升多步推理能力与事实基础，因此近期检索增强生成系统每次查询时都需要从检索文本中重建结构。当语料库\textit{已具有}结构化特征——如JSON Lines（一种广泛用于LLM提示词、化合物及地理空间记录的数据格式）——这种逐查询重建过程可被直接\textit{结构检索}替代。其核心原语是\textit{子结构搜索}：在集合中查找所有包含给定查询模式的JSON对象。现有方法逐个索引每个文档，导致索引空间和查询时间均随集合规模增长；基于XML的引擎则引入转换开销和语义失配。我们提出\textbf{jXBW}——一种支持JSONL快速子结构搜索的压缩索引，融合三项创新：(i) 合并对象间重复结构的归并树表示，(ii) 基于扩展Burrows–Wheeler变换的简洁树索引，(iii) 新开发的在此索引上运行的三阶段子结构搜索算法。三者协同实现\textbf{查询依赖复杂度}：成本由查询特征而非集合规模决定，且运行于压缩空间。在包括PubChem（$10^6$个化合物）和OpenStreetMap（$6.6 \times 10^6$个对象）的七个真实数据集上，实验显示jXBW在最小数据集上比最强树基线方法快$\mathbf{16\times}$，在最大数据集上最高快$\mathbf{2{,}800\times}$，比XQuery引擎Saxon快超过$\mathbf{2 \times 10^6\times}$。因此，jXBW使百万记录JSONL集合的结构检索进入亚毫秒范围。