The vast amounts of data collected in various domains pose great challenges to modern data exploration and analysis. To find "interesting" objects in large databases, users typically define a query using positive and negative example objects and train a classification model to identify the objects of interest in the entire data catalog. However, this approach requires a scan of all the data to apply the classification model to each instance in the data catalog, making this method prohibitively expensive to be employed in large-scale databases serving many users and queries interactively. In this work, we propose a novel framework for such search-by-classification scenarios that allows users to interactively search for target objects by specifying queries through a small set of positive and negative examples. Unlike previous approaches, our framework can rapidly answer such queries at low cost without scanning the entire database. Our framework is based on an index-aware construction scheme for decision trees and random forests that transforms the inference phase of these classification models into a set of range queries, which in turn can be efficiently executed by leveraging multidimensional indexing structures. Our experiments show that queries over large data catalogs with hundreds of millions of objects can be processed in a few seconds using a single server, compared to hours needed by classical scanning-based approaches.
翻译:各领域收集的海量数据对现代数据探索与分析提出了巨大挑战。为在大型数据库中寻找"感兴趣"的对象,用户通常通过正例和反例样本定义查询,训练分类模型以识别整个数据目录中的目标对象。然而,该方法需要对所有数据进行全量扫描以将分类模型应用于数据目录中的每个实例,这使得该方法在服务众多用户的交互式查询的大规模数据库中代价高昂且难以实施。本研究提出了一种针对此类分类检索场景的新型框架,允许用户通过少量正反例样本指定查询来交互式搜索目标对象。与先前方法不同,该框架无需扫描整个数据库即可低成本快速响应此类查询。该框架基于索引感知的决策树与随机森林构建方案,将分类模型的推理阶段转化为一组范围查询,进而可通过利用多维索引结构高效执行。实验表明,在包含数亿对象的超大规模数据目录上,单台服务器即可在数秒内完成查询处理,而传统基于扫描的方法需要数小时。