Scalable Batch Correction for Cell Painting via Batch-Dependent Kernels and Adaptive Sampling

Cell Painting is a microscopy-based, high-content imaging assay that produces rich morphological profiles of cells and can support drug discovery by quantifying cellular responses to chemical perturbations. At scale, however, Cell Painting data is strongly affected by batch effects arising from differences in laboratories, instruments, and protocols, which can obscure biological signal. We present BALANS (Batch Alignment via Local Affinities and Subsampling), a scalable batch-correction method that aligns samples across batches by constructing a smoothed affinity matrix from pairwise distances. Given $n$ data points, BALANS builds a sparse affinity matrix $A \in \mathbb{R}^{n \times n}$ using two ideas. (i) For points $i$ and $j$, it sets a local scale using the distance from $i$ to its $k$-th nearest neighbor within the batch of $j$, then computes $A_{ij}$ via a Gaussian kernel calibrated by these batch-aware local scales. (ii) Rather than forming all $n^2$ entries, BALANS uses an adaptive sampling procedure that prioritizes rows with low cumulative neighbor coverage and retains only the strongest affinities per row, yielding a sparse but informative approximation of $A$. We prove that this sampling strategy is order-optimal in sample complexity and provides an approximation guarantee, and we show that BALANS runs in nearly linear time in $n$. Experiments on diverse real-world Cell Painting datasets and controlled large-scale synthetic benchmarks demonstrate that BALANS scales to large collections while improving runtime over native implementations of widely used batch-correction methods, without sacrificing correction quality.

翻译：细胞绘画是一种基于显微镜的高内涵成像检测技术，能够生成丰富的细胞形态学特征图谱，并通过量化细胞对化学扰动的响应来支持药物发现。然而，在大规模应用中，细胞绘画数据会受到实验室、仪器和实验方案差异引起的批次效应的严重影响，这些效应可能掩盖真实的生物学信号。本文提出BALANS（基于局部亲和性与子采样的批次对齐方法），这是一种可扩展的批次校正方法，通过从成对距离构建平滑的亲和矩阵来实现跨批次样本的对齐。给定 $n$ 个数据点，BALANS 采用两个核心思想构建稀疏亲和矩阵 $A \in \mathbb{R}^{n \times n}$。（i）对于点 $i$ 和 $j$，该方法利用 $i$ 到 $j$ 所属批次内其第 $k$ 个最近邻的距离设定局部尺度，然后通过由这些批次感知的局部尺度校准的高斯核计算 $A_{ij}$。（ii）BALANS 不计算全部 $n^2$ 个矩阵元素，而是采用自适应采样策略：优先选择累积邻域覆盖度较低的行，并仅保留每行中最强的亲和值，从而得到 $A$ 的稀疏但信息丰富的近似矩阵。我们证明该采样策略在样本复杂度上达到阶次最优，并提供近似保证，同时表明 BALANS 的时间复杂度接近 $n$ 的线性阶。在多样化的真实世界细胞绘画数据集和受控的大规模合成基准测试上的实验表明，BALANS 能够扩展到大规模数据集合，在保持校正质量的同时，相比广泛使用的批次校正方法的原生实现显著提升了运行效率。