Modern, large scale monitoring systems have to process and store vast amounts of log data in near real-time. At query time the systems have to find relevant logs based on the content of the log message using support structures that can scale to these amounts of data while still being efficient to use. We present our novel Compressed Probabilistic Retrieval algorithm (COPR), capable of answering Multi-Set Multi-Membership-Queries, that can be used as an alternative to existing indexing structures for streamed log data. In our experiments, COPR required up to 93% less storage space than the tested state-of-the-art inverted index and had up to four orders of magnitude less false-positives than the tested state-of-the-art membership sketch. Additionally, COPR achieved up to 250 times higher query throughput than the tested inverted index and up to 240 times higher query throughput than the tested membership sketch.
翻译:现代大规模监控系统需在近实时场景下处理并存储海量日志数据,并在查询时依据日志消息内容,借助能够扩展至海量数据规模且保持高效运行的支持结构,快速定位相关日志。本文提出了一种新型压缩概率检索算法(COPR),该算法能够支持多集合多成员查询,可作为流式日志数据中现有索引结构的替代方案。实验表明,与当前最先进的倒排索引相比,COPR的存储空间需求降低达93%;与当前最先进的成员概要数据结构相比,其假阳性率最多可降低四个数量级。此外,COPR的查询吞吐量较倒排索引提升高达250倍,较成员概要数据结构提升高达240倍。