High-Dimensional Low-Sample Size (HDLSS) tabular domains (e.g., omics) are characterized by $n \ll m$, where $n$ = number of samples, and $m$ = number of features. Such domains often exhibit strong local correlation groups, sparse cross-group dependencies, heavy-tailed non-Gaussian marginals, heteroscedastic noise, and structured missingness, making direct density learning in $\mathbb{R}^m$ ill-conditioned since $n \ll m$. We propose BSTabDiff, a block-subunit generative framework that partitions the $m$ observed features into $M$ latent blocks ($M \ll m$) and generates each block via a shared low-dimensional subunit variable, concentrating global dependence learning in the compact block-latent space $\mathbb{R}^M$ while decoding to the full feature space with copula-driven dependence, flexible per-feature marginals, and explicit missingness mechanisms. BSTabDiff supports modern deep priors on block latents, including diffusion and normalizing flows, enabling stable synthesis and controllable benchmark generation in the HDLSS regime. Empirically, BSTabDiff produces more realistic and stable high-dimensional synthetic data when compared with unstructured tabular generators on HDLSS data.
翻译:高维低样本量(HDLSS)表格领域(如组学数据)具有 $n \ll m$ 的特征,其中 $n$ 为样本数,$m$ 为特征数。此类领域通常表现出强局部相关组、稀疏跨组依赖、重尾非高斯边缘分布、异方差噪声以及结构化缺失,这使得在 $\mathbb{R}^m$ 空间中进行直接密度学习因 $n \ll m$ 而病态。我们提出BSTabDiff,一种块-子单元生成框架,它将 $m$ 个观测特征划分为 $M$ 个潜在块($M \ll m$),并通过共享的低维子单元变量生成每个块,将全局依赖学习集中于紧凑的块潜空间 $\mathbb{R}^M$ 中,同时通过copula驱动的依赖、灵活的逐特征边缘分布以及显式缺失机制解码至完整特征空间。BSTabDiff支持块潜变量上的现代深度先验,包括扩散模型和归一化流,从而在HDLSS场景中实现稳定合成与可控基准生成。实验表明,与HDLSS数据上的非结构化表格生成器相比,BSTabDiff能生成更逼真且稳定的高维合成数据。