Analysts and scientists are interested in querying streams of video, audio, and text to extract quantitative insights. For example, an urban planner may wish to measure congestion by querying the live feed from a traffic camera. Prior work has used deep neural networks (DNNs) to answer such queries in the batch setting. However, much of this work is not suited for the streaming setting because it requires access to the entire dataset before a query can be submitted or is specific to video. Thus, to the best of our knowledge, no prior work addresses the problem of efficiently answering queries over multiple modalities of streams. In this work we propose InQuest, a system for accelerating aggregation queries on unstructured streams of data with statistical guarantees on query accuracy. InQuest leverages inexpensive approximation models ("proxies") and sampling techniques to limit the execution of an expensive high-precision model (an "oracle") to a subset of the stream. It then uses the oracle predictions to compute an approximate query answer in real-time. We theoretically analyzed InQuest and show that the expected error of its query estimates converges on stationary streams at a rate inversely proportional to the oracle budget. We evaluated our algorithm on six real-world video and text datasets and show that InQuest achieves the same root mean squared error (RMSE) as two streaming baselines with up to 5.0x fewer oracle invocations. We further show that InQuest can achieve up to 1.9x lower RMSE at a fixed number of oracle invocations than a state-of-the-art batch setting algorithm.
翻译:分析师和科学家们希望对视频、音频和文本数据流进行查询,以提取定量洞察。例如,城市规划者可能希望通过查询交通摄像头的实时视频流来测量拥堵程度。此前的研究利用深度神经网络在批处理模式下回答此类查询。然而,这些工作大多不适用于流式场景,因为其需要访问整个数据集才能提交查询,或仅针对视频数据。据我们所知,尚无研究解决跨多模态数据流的查询高效应答问题。本文提出InQuest系统,用于加速非结构化数据流上的聚合查询,并对其查询精度提供统计保证。InQuest利用廉价的近似模型("代理")和采样技术,将昂贵的高精度模型("预言机")的执行限制在数据流的子集上,并基于预言机预测实时计算近似查询结果。我们通过理论分析证明,在平稳数据流上,InQuest查询估计的期望误差以与预言机预算成反比的速率收敛。在六个真实视频与文本数据集上的实验表明,InQuest在实现与两种流式基线相同的均方根误差时,预言机调用次数减少了最多5.0倍。此外,在固定预言机调用次数下,InQuest的均方根误差比当前最优的批处理算法低最多1.9倍。