The blessing of ubiquitous data also comes with a curse: the communication, storage, and labeling of massive, mostly redundant datasets. We seek to solve this problem at its core, collecting only valuable data and throwing out the rest via submodular maximization. Specifically, we develop algorithms for the online and distributed version of the problem, where data selection occurs in an uncoordinated fashion across multiple data streams. We design a general and flexible core selection routine for our algorithms which, given any stream of data, any assessment of its value, and any formulation of its selection cost, extracts the most valuable subset of the stream up to a constant factor while using minimal memory. Notably, our methods have the same theoretical guarantees as their offline counterparts, and, as far as we know, provide the first guarantees for online distributed submodular optimization in the literature. Finally, in learning tasks on ImageNet and MNIST, we show that our selection methods outperform random selection by $5-20\%$.
翻译:海量数据的普及也带来了一个难题:通信、存储和标注大规模且大多冗余的数据集。我们试图从根本上解决这一问题,仅收集有价值的数据,并通过子模最大化丢弃其余部分。具体而言,我们针对该问题的在线和分布式版本开发了算法,其中数据选择以非协调方式在多个数据流中执行。我们为算法设计了一种通用且灵活的核心选择例程,该例程在给定任意数据流、任意价值评估及任意选择成本公式的情况下,能提取出该数据流中价值最高的子集(与最优解仅差一个常数因子),同时使用最小内存。值得注意的是,我们的方法具有与其离线版本相同的理论保证,并且据我们所知,这是文献中首次为在线分布式子模优化提供保证。最后,在ImageNet和MNIST数据集的学习任务中,我们表明所选方法的性能优于随机选择$5-20\%$。