Modern datasets often contain ballast as redundant or low-utility information that increases dimensionality, storage requirements, and computational cost without contributing meaningful analytical value. This study introduces a generalized, multimodal framework for ballast detection and reduction across structured, semi-structured, unstructured, and sparse data types. Using diverse datasets, entropy, mutual information, Lasso, SHAP, PCA, topic modelling, and embedding analysis are applied to identify and eliminate ballast features. A novel Ballast Score is proposed to integrate these signals into a unified, cross-modal pruning strategy. Experimental results demonstrate that significant portions of the feature space as often exceeding 70% in sparse or semi-structured data, can be pruned with minimal or even improved classification performance, along with substantial reductions in training time and memory footprint. The framework reveals distinct ballast typologies (e.g. statistical, semantic, infrastructural), and offers practical guidance for leaner, more efficient machine learning pipelines.
翻译:现代数据集常包含冗余或低效用的冗余信息,这些信息增加了数据维度、存储需求和计算成本,却未提供有意义的分析价值。本研究提出了一种通用的多模态框架,用于在结构化、半结构化、非结构化和稀疏数据类型中检测与削减冗余信息。通过应用熵、互信息、Lasso、SHAP、PCA、主题建模和嵌入分析等多种方法,我们在不同数据集上识别并消除了冗余特征。本文提出了一种新颖的冗余评分,将这些信号整合为一个统一的跨模态剪枝策略。实验结果表明,在稀疏或半结构化数据中,通常超过70%的特征空间可以被剪枝,同时分类性能损失极小甚至有所提升,并显著减少了训练时间和内存占用。该框架揭示了不同的冗余类型(如统计型、语义型、基础设施型),并为构建更精简、更高效的机器学习流程提供了实用指导。