The vast amounts of data collected in various domains pose great challenges to modern data exploration and analysis. To find "interesting" objects in large databases, users typically define a query using positive and negative example objects and train a classification model to identify the objects of interest in the entire data catalog. However, this approach requires a scan of all the data to apply the classification model to each instance in the data catalog, making this method prohibitively expensive to be employed in large-scale databases serving many users and queries interactively. In this work, we propose a novel framework for such search-by-classification scenarios that allows users to interactively search for target objects by specifying queries through a small set of positive and negative examples. Unlike previous approaches, our framework can rapidly answer such queries at low cost without scanning the entire database. Our framework is based on an index-aware construction scheme for decision trees and random forests that transforms the inference phase of these classification models into a set of range queries, which in turn can be efficiently executed by leveraging multidimensional indexing structures. Our experiments show that queries over large data catalogs with hundreds of millions of objects can be processed in a few seconds using a single server, compared to hours needed by classical scanning-based approaches.
翻译:各领域积累的海量数据给现代数据探索与分析带来了巨大挑战。为在大型数据库中寻找"感兴趣"的对象,用户通常利用正例和负例样本对象定义查询,并训练分类模型来识别整个数据目录中的目标对象。然而,这种方法需要扫描全部数据以对数据目录中的每个实例应用分类模型,这使得该方法在服务众多用户和交互式查询的大规模数据库中成本过高而难以实用。本文提出了一种新颖的搜索-分类框架,允许用户通过少量正负样本指定查询,交互式搜索目标对象。与以往方法不同,本框架无需扫描整个数据库即可低成本快速响应此类查询。该框架基于索引感知的决策树与随机森林构建方案,将分类模型的推理阶段转化为一系列范围查询,进而可利用多维索引结构高效执行。实验表明,采用单台服务器即可在数秒内处理包含数亿对象的大型数据目录查询,而传统扫描方法需要数小时。