These are the lecture notes for the course CM0622 - Algorithms for Massive Data, Ca' Foscari University of Venice. The goal of this course is to introduce algorithmic techniques for dealing with massive data: data so large that it does not fit in the computer's memory. There are two main solutions to deal with massive data: (lossless) compressed data structures and (lossy) data sketches. These notes cover both topics: compressed suffix arrays, probabilistic filters, sketching under various metrics, Locality Sensitive Hashing, nearest neighbour search, algorithms on streams.
翻译:本文档为威尼斯卡福斯卡里大学CM0622课程《海量数据算法》的讲义。本课程旨在介绍处理海量数据的算法技术:即数据规模过大,无法完全载入计算机内存的情形。处理海量数据主要有两种解决方案:(无损)压缩数据结构与(有损)数据草图。本讲义涵盖以下主题:压缩后缀数组、概率过滤器、多种度量下的草图技术、局部敏感哈希、最近邻搜索以及流数据算法。