Larger than memory image processing

This report addresses larger-than-memory image analysis for petascale datasets such as 1.4 PB electron-microscopy volumes and 150 TB human-organ atlases. We argue that performance is fundamentally I/O-bound. We show that structuring analysis as streaming passes over data is crucial. For 3D volumes, two representations are popular: stacks of 2D slices (e.g., directories or multi-page TIFF) and 3D chunked layouts (e.g., Zarr/HDF5). While for a few algorithms, chunked layout on disk is crucial to keep disk I/O at a minimum, we show how the slice-based streaming architecture can be built on top of either image representation in a manner that minimizes disk I/O. This is in particular advantageous for algorithms relying on neighbouring values, since the slicing streaming architecture is 1D, which implies that there are only 2 possible sweeping orders, both of which are aligned with the order in which images are read from the disk. This is in contrast to 3D chunks, in which any sweep cannot be done without accessing each chunk at least 9 times. We formalize this with sweep-based execution (natural 2D/3D orders), windowed operations, and overlap-aware tiling to minimize redundant access. Building on these principles, we introduce a domain-specific language (DSL) that encodes algorithms with intrinsic knowledge of their optimal streaming and memory use; the DSL performs compile-time and run-time pipeline analyses to automatically select window sizes, fuse stages, tee and zip streams, and schedule passes for limited-RAM machines, yielding near-linear I/O scans and predictable memory footprints. The approach integrates with existing tooling for segmentation and morphology but reframes pre/post-processing as pipelines that privilege sequential read/write patterns, delivering substantial throughput gains for extremely large images without requiring full-volume residency in memory.

翻译：本报告针对PB级数据集（如1.4 PB电子显微镜三维体数据和150 TB人体器官图谱）的超内存图像分析问题展开研究。我们指出性能瓶颈本质上受I/O限制，并论证将分析过程构建为数据流式处理流程至关重要。针对三维体数据，目前存在两种主流存储格式：二维切片堆栈（如目录结构或多页TIFF文件）与三维分块布局（如Zarr/HDF5格式）。虽然少数算法依赖磁盘分块布局以实现最小化磁盘I/O，但我们展示了基于切片的流式架构可兼容两种图像表示形式，并能有效降低磁盘I/O开销。该架构尤其适用于依赖邻域值的算法——由于切片流式架构采用一维处理模式，仅存在两种扫描顺序，且均与磁盘读取顺序保持一致。相比之下，三维分块处理中任意扫描顺序都需对每个数据块至少访问9次。我们通过基于扫描的执行策略（自然二维/三维顺序）、窗口化操作及重叠感知分块技术来形式化这一过程，以最小化冗余访问。基于这些原理，我们提出一种领域特定语言（DSL），该语言在编码算法时内嵌了最优流式处理与内存使用知识；DSL通过编译时与运行时流水线分析，自动选择窗口尺寸、融合处理阶段、实施流式分支与合并操作，并为有限内存设备调度处理流程，从而实现近线性I/O扫描与可预测的内存占用量。该方法兼容现有分割与形态学工具链，并将预处理/后处理重构为优先顺序读写模式的流水线，在无需全数据载入内存的前提下，为超大规模图像处理带来显著的吞吐量提升。