Deep learning has achieved impressive performance in many domains, such as computer vision and natural language processing, but its advantage over classical shallow methods on tabular datasets remains questionable. It is especially challenging to surpass the performance of tree-like ensembles, such as XGBoost or Random Forests, on small-sized datasets (less than 1k samples). To tackle this challenge, we introduce HyperTab, a hypernetwork-based approach to solving small sample problems on tabular datasets. By combining the advantages of Random Forests and neural networks, HyperTab generates an ensemble of neural networks, where each target model is specialized to process a specific lower-dimensional view of the data. Since each view plays the role of data augmentation, we virtually increase the number of training samples while keeping the number of trainable parameters unchanged, which prevents model overfitting. We evaluated HyperTab on more than 40 tabular datasets of a varying number of samples and domains of origin, and compared its performance with shallow and deep learning models representing the current state-of-the-art. We show that HyperTab consistently outranks other methods on small data (with a statistically significant difference) and scores comparable to them on larger datasets. We make a python package with the code available to download at https://pypi.org/project/hypertab/
翻译:摘要:深度学习在计算机视觉和自然语言处理等诸多领域已展现出卓越性能,但在表格数据集上相较于传统浅层方法的优势仍存争议。尤其是在小规模数据集(样本量少于1000个)上超越 XGBoost 或随机森林等树集成模型的表现极具挑战性。为应对这一难题,我们提出 HyperTab——一种基于超网络解决表格数据小样本问题的方法。通过融合随机森林与神经网络的优点,HyperTab 生成一个神经网络集成,其中每个目标模型专门处理数据的特定低维视图。由于每个视图均起到数据增强作用,我们在保持可训练参数数量不变的同时,虚拟增加了训练样本量,从而防止模型过拟合。我们在超过40个样本数量与领域来源各异的表格数据集上评估了 HyperTab,并将其性能与代表当前最先进水平的浅层及深度模型进行比较。结果表明,HyperTab 在小数据上持续优于其他方法(具有统计显著性差异),且在较大数据集上取得与它们相当的成绩。我们提供了可下载的 Python 包,代码地址为 https://pypi.org/project/hypertab/。