These are the lecture notes for the course CM0622 - Algorithms for Massive Data, Ca' Foscari University of Venice. The goal of this course is to introduce algorithmic techniques for dealing with massive data: data so large that it does not fit in the computer's memory. Broadly speaking, there are two main solutions to deal with massive data: (lossless) compressed data structures and (lossy) data sketches. These notes cover the latter topic: probabilistic filters, sketching under various metrics, Locality Sensitive Hashing, nearest neighbour search, algorithms on streams (pattern matching, counting).
翻译:本文为威尼斯大学CM0622课程《大规模数据处理算法》的讲义。本课程旨在介绍处理大规模数据的算法技术:当数据量庞大至无法装入计算机内存时的应对方案。广义而言,处理大规模数据主要有两种方式:(无损)压缩数据结构与(有损)数据概要。本讲义涵盖后一主题:概率过滤器、多种度量下的概要技术、局部敏感哈希、最近邻搜索、流式算法(模式匹配、计数)。