High-resolution images are prevalent in various applications, such as autonomous driving and computer-aided diagnosis. However, training neural networks on such images is computationally challenging and easily leads to out-of-memory errors even on modern GPUs. We propose a simple method, Iterative Patch Selection (IPS), which decouples the memory usage from the input size and thus enables the processing of arbitrarily large images under tight hardware constraints. IPS achieves this by selecting only the most salient patches, which are then aggregated into a global representation for image recognition. For both patch selection and aggregation, a cross-attention based transformer is introduced, which exhibits a close connection to Multiple Instance Learning. Our method demonstrates strong performance and has wide applicability across different domains, training regimes and image sizes while using minimal accelerator memory. For example, we are able to finetune our model on whole-slide images consisting of up to 250k patches (>16 gigapixels) with only 5 GB of GPU VRAM at a batch size of 16.
翻译:高分辨率图像在自动驾驶和计算机辅助诊断等多种应用中普遍存在。然而,在此类图像上训练神经网络即使在现代GPU上也面临计算挑战,并容易导致内存溢出错误。我们提出了一种简单方法——迭代式补丁选择(IPS),该方法将内存使用与输入尺寸解耦,从而能够在严格的硬件约束下处理任意大尺寸的图像。IPS通过仅选择最显著的补丁来实现这一目标,这些补丁随后会被聚合为用于图像识别的全局表示。在补丁选择和聚合过程中,我们引入了一种基于交叉注意力的Transformer,它与多实例学习密切相关。我们的方法表现出强大性能,在不同领域、训练范式和图像尺寸下具有广泛适用性,同时仅使用极少的加速器内存。例如,我们能够仅用5 GB GPU显存、批量大小为16的条件下,在由多达25万张补丁(超过160亿像素)组成的全切片图像上对模型进行微调。