Machine learning tasks over image databases often generate masks that annotate image content (e.g., saliency maps, segmentation maps) and enable a variety of applications (e.g., determine if a model is learning spurious correlations or if an image was maliciously modified to mislead a model). While queries that retrieve examples based on mask properties are valuable to practitioners, existing systems do not support such queries efficiently. In this paper, we formalize the problem and propose a system, MaskSearch, that focuses on accelerating queries over databases of image masks. MaskSearch leverages a novel indexing technique and an efficient filter-verification query execution framework. Experiments on real-world datasets with our prototype show that MaskSearch, using indexes approximately 5% the size of the data, accelerates individual queries by up to two orders of magnitude and consistently outperforms existing methods on various multi-query workloads that simulate dataset exploration and analysis processes.
翻译:针对图像数据库的机器学习任务常生成标注图像内容的掩码(如显著性图、分割图),这些掩码支持多种应用(例如判断模型是否学习到虚假关联,或图像是否被恶意修改以误导模型)。尽管基于掩码属性检索示例的查询对从业者具有重要价值,现有系统无法高效支持此类查询。本文形式化了该问题,并提出系统MaskSearch,专注于加速图像掩码数据库上的查询。MaskSearch利用新型索引技术和高效的过滤-验证查询执行框架。基于原型系统在真实数据集上的实验表明,MaskSearch使用仅占数据量约5%的索引,可将单个查询加速两个数量级,并在模拟数据集探索与分析过程的多查询工作负载中始终优于现有方法。