Streaming allows executing queries over massive JSON or XML documents whose size makes it infeasible to fully parse them into a tree. Earliest query answering is a radical approach to reducing latency and memory footprint. To minimize latency, a document node must be returned as soon as the node is guaranteed to be an answer regardless of how the document ends. Similarly, to minimize memory footprint, a node must be discarded as soon as it cannot become an answer regardless of how the document ends. For simple queries that select nodes based on the path from the root, the decision for each node can be made on the spot, but practical languages such as XPath or JSONpath support filters, which allow selecting nodes based on information collected from various parts of the document, possibly further down the stream. This makes earliest query answering a challenging task, as candidate nodes must be kept in memory until it becomes clear that they can be safely returned or discarded. We show that this can be done for all unary queries expressible in monadic second order logic (MSO), while ensuring constant update time -- provided that nodes are returned by passing a suitable iterator, rather than one by one.
翻译:流式处理允许在大型JSON或XML文档上执行查询,这些文档的规模使得将其完全解析为树结构变得不可行。最早查询应答是一种降低延迟和内存占用的激进方法。为最小化延迟,文档节点必须在保证其必定是答案(无论文档如何结尾)时立即返回。类似地,为最小化内存占用,节点必须在确定其不可能成为答案(无论文档如何结尾)时立即丢弃。对于基于从根节点路径选择节点的简单查询,每个节点的判定可即时完成,但实际语言(如XPath或JSONpath)支持过滤条件,允许基于从文档各个部分(可能包括后续流数据)收集的信息选择节点。这使得最早查询应答成为一项具有挑战性的任务——候选节点必须保留在内存中,直到明确可以安全返回或丢弃。我们证明,对于所有可用一元查询表达的单子二阶逻辑(MSO)查询,在保证常数更新时间的前提下均可实现此目标——前提是通过传递合适的迭代器(而非逐个节点)返回节点。