Data exploration and analysis in various domains often necessitate the search for specific objects in massive databases. A common search strategy, often known as search-by-classification, resorts to training machine learning models on small sets of positive and negative samples and to performing inference on the entire database to discover additional objects of interest. While such an approach often yields very good results in terms of classification performance, the entire database usually needs to be scanned, a process that can easily take several hours even for medium-sized data catalogs. In this work, we present RapidEarth, a geospatial search-by-classification engine that allows analysts to rapidly search for interesting objects in very large data collections of satellite imagery in a matter of seconds, without the need to scan the entire data catalog. RapidEarth embodies a co-design of multidimensional indexing structures and decision branches, a recently proposed variant of classical decision trees. These decision branches allow RapidEarth to transform the inference phase into a set of range queries, which can be efficiently processed by leveraging the aforementioned multidimensional indexing structures. The main contribution of this work is a geospatial search engine that implements these technical findings.
翻译:跨领域的数据探索与分析常需在庞大数据集中检索特定目标。一种常见的搜索策略——基于分类的搜索(search-by-classification)——通过利用少量正负样本训练机器学习模型,并对整个数据库进行推理以发现更多感兴趣目标。尽管此类方法在分类性能上通常表现优异,但全库扫描的推理过程即便对于中等规模的数据目录也需要数小时之久。本文提出地理空间分类搜索引擎RapidEarth,使分析人员能够在数秒内从超大规模卫星影像数据集中快速检索感兴趣目标,无需遍历整个数据目录。RapidEarth融合了多维索引结构与决策分支(decision branches)的协同设计——后者是经典决策树的最新变体。决策分支机制将推理阶段转化为一系列范围查询,通过多维索引结构实现高效处理。本研究的核心贡献在于将上述技术成果落地为实用地理空间搜索引擎。