Association Rule Mining (ARM) aims to discover patterns between features in datasets in the form of propositional rules, supporting both knowledge discovery and interpretable machine learning in high-stakes decision-making. However, in high-dimensional settings, rule explosion and computational overhead render popular algorithmic approaches impractical without effective search space reduction, challenges that propagate to downstream tasks. Neurosymbolic methods, such as Aerial+, have recently been proposed to address the rule explosion in ARM. While they tackle the high dimensionality of the data, they also inherit limitations of neural networks, particularly reduced performance in low-data regimes. This paper makes three key contributions to association rule discovery in high-dimensional tabular data. First, we empirically show that Aerial+ scales one to two orders of magnitude better than state-of-the-art algorithmic and neurosymbolic baselines across five real-world datasets. Second, we introduce the novel problem of ARM in high-dimensional, low-data settings, such as gene expression data from the biomedicine domain with around 18k features and 50 samples. Third, we propose two fine-tuning approaches to Aerial+ using tabular foundation models. Our proposed approaches are shown to significantly improve rule quality on five real-world datasets, demonstrating their effectiveness in low-data, high-dimensional scenarios.
翻译:关联规则挖掘(ARM)旨在以命题规则的形式发现数据集中特征之间的模式,为高风险决策中的知识发现和可解释机器学习提供支持。然而,在高维场景下,规则爆炸和计算开销使得流行的算法方法在没有有效搜索空间缩减的情况下变得不切实际,这些挑战会进一步影响下游任务。近年来,诸如Aerial+等神经符号方法被提出以解决ARM中的规则爆炸问题。尽管这些方法处理了数据的高维性,但它们也继承了神经网络的局限性,尤其是在低数据量场景下性能下降。本文针对高维表格数据中的关联规则发现做出了三项关键贡献。首先,我们通过实证表明,在五个真实世界数据集上,Aerial+的扩展性比最先进的算法和神经符号基线方法提升了一到两个数量级。其次,我们提出了高维低数据量场景下的ARM新问题,例如生物医学领域中具有约1.8万个特征和50个样本的基因表达数据。第三,我们提出了两种基于表格基础模型对Aerial+进行微调的方法。实验表明,我们提出的方法在五个真实世界数据集上显著提升了规则质量,证明了其在低数据量、高维场景下的有效性。