Sherlock in OSS: A Novel Approach of Content-Based Searching in Object Storage System

from arxiv, 13 Pages, 9 Figures, Submitted to IEEE Transactions on Parallel and Distributed Systems for possible publication. arXiv admin note: substantial text overlap with arXiv:1306.3075 by other authors; substantial text overlap with arXiv:1910.05786 by other authors without attribution

Object Storage Systems (OSS) inside a cloud promise scalability, durability, availability, and concurrency. However, open-source OSS does not have a specific approach to letting users and administrators search based on the data, which is contained inside the object storage, without involving the entire cloud infrastructure. Therefore, in this paper, we propose Sherlock, a novel Content-Based Searching (CoBS) architecture to extract additional information from images and documents. Here, we store the additional information in an Elasticsearch-enabled database, which helps us to search for our desired data based on its contents. This approach works in two sequential stages. First, the data will be uploaded to a classifier that will determine the data type and send it to the specific model for the data. Here, the images that are being uploaded are sent to our trained model for object detection, and the documents are sent for keyword extraction. Next, the extracted information is sent to Elasticsearch, which enables searching based on the contents. Because the precision of the models is so fundamental to the search's correctness, we train our models with comprehensive datasets (Microsoft COCO Dataset for multimedia data and SemEval2017 Dataset for document data). Furthermore, we put our designed architecture to the test with a real-world implementation of an open-source OSS called OpenStack Swift. We upload images into the dataset of our implementation in various segments to find out the efficacy of our proposed model in real-life Swift object storage.

翻译：云中的对象存储系统(OSS)承诺了可扩展性、持久性、可用性和并发性。然而，开源OSS并未提供特定方法让用户和管理员在不涉及整个云基础设施的情况下，基于对象存储内部所包含的数据进行搜索。因此，本文提出了一种新颖的基于内容搜索(CoBS)架构——Sherlock，用于从图像和文档中提取额外信息。我们将这些额外信息存储于启用Elasticsearch的数据库中，从而支持基于内容搜索所需的数据。该方法按两个顺序阶段工作。首先，数据将被上传至一个分类器，该分类器确定数据类型并将其发送至对应数据的特定模型。在此过程中，上传的图像被送入我们训练好的目标检测模型，而文档则被送入关键词提取模型。接下来，提取的信息被发送至Elasticsearch，从而实现基于内容的搜索。由于模型的精确度对搜索正确性至关重要，我们使用综合性数据集（多媒体数据采用Microsoft COCO数据集，文档数据采用SemEval2017数据集）训练模型。此外，我们通过对开源OSS——OpenStack Swift的实际部署来测试所设计的架构。我们将图像分批次上传至所实现的数据集中，以评估所提模型在实际Swift对象存储中的有效性。