StreamSampling$.$jl is a Julia library designed to provide general and efficient methods for sampling from data streams in a single pass, even when the total number of items is unknown. In this paper, we describe the capabilities of the library and its advantages over traditional sampling procedures, such as maintaining a small, constant memory footprint and avoiding the need to fully materialize the stream in memory. Furthermore, we provide empirical benchmarks comparing online sampling methods against standard approaches, demonstrating performance and memory improvements.
翻译:StreamSampling.jl 是一个 Julia 库,旨在提供通用且高效的方法,用于在单次遍历中从数据流中抽样,即使项目总数未知。本文描述了该库的功能及其相比传统抽样程序的优势,例如保持较小的恒定内存占用,以及避免将整个流完全加载到内存中。此外,我们提供了实证基准测试,将在线抽样方法与标准方法进行比较,展示了性能和内存的改进。