The vast amounts of data collected in various domains pose great challenges to modern data exploration and analysis. To find "interesting" objects in large databases, users typically define a query using positive and negative example objects and train a classification model to identify the objects of interest in the entire data catalog. However, this approach requires a scan of all the data to apply the classification model to each instance in the data catalog, making this method prohibitively expensive to be employed in large-scale databases serving many users and queries interactively. In this work, we propose a novel framework for such search-by-classification scenarios that allows users to interactively search for target objects by specifying queries through a small set of positive and negative examples. Unlike previous approaches, our framework can rapidly answer such queries at low cost without scanning the entire database. Our framework is based on an index-aware construction scheme for decision trees and random forests that transforms the inference phase of these classification models into a set of range queries, which in turn can be efficiently executed by leveraging multidimensional indexing structures. Our experiments show that queries over large data catalogs with hundreds of millions of objects can be processed in a few seconds using a single server, compared to hours needed by classical scanning-based approaches.
翻译:各领域海量数据的采集对现代数据探索与分析构成严峻挑战。为在大型数据库中寻找"感兴趣"对象,用户通常通过正负示例样本定义查询,训练分类模型以识别整个数据目录中的目标对象。然而,该方法需扫描全部数据,将分类模型逐一应用于数据目录中的每个实例,这使得该技术在多用户交互查询的大规模数据库中因成本过高而难以实际应用。本文针对此类搜索-分类场景提出新型框架,允许用户通过少量正负示例指定查询,以交互方式搜索目标对象。与现有方法不同,本框架无需扫描整个数据库即可快速低成本响应此类查询。该框架基于索引感知的决策树与随机森林构建方案,将分类模型的推理阶段转化为一系列范围查询,进而利用多维索引结构高效执行。实验表明,在包含数亿级对象的大规模数据目录中,采用单台服务器即可在数秒内处理查询,而传统基于扫描的方法需要数小时。