In real-world data, information is stored in extremely large feature vectors. These variables are typically correlated due to complex interactions involving many features simultaneously. Such correlations qualitatively correspond to semantic roles and are naturally recognized by both the human brain and artificial neural networks. This recognition enables, for instance, the prediction of missing parts of an image or text based on their context. We present a method to detect these correlations in high-dimensional data represented as binary numbers. We estimate the binary intrinsic dimension of a dataset, which quantifies the minimum number of independent coordinates needed to describe the data, and is therefore a proxy of semantic complexity. The proposed algorithm is largely insensitive to the so-called curse of dimensionality, and can therefore be used in big data analysis. We test this approach identifying phase transitions in model magnetic systems and we then apply it to the detection of semantic correlations of images and text inside deep neural networks.
翻译:在现实世界数据中,信息存储于极高维的特征向量中。由于涉及多个特征同时作用的复杂交互,这些变量通常具有相关性。此类相关性在性质上对应语义角色,并能被人脑和人工神经网络自然识别。这种识别能力使得系统能够根据上下文预测图像或文本的缺失部分,例如。我们提出一种检测高维二进制数据中此类相关性的方法。通过估计数据集的二进制本征维度——该维度量化了描述数据所需的最小独立坐标数,因而可作为语义复杂度的代理指标。所提算法对所谓的维度灾难具有较强鲁棒性,故可应用于大数据分析。我们在模型磁系统中通过识别相变验证该方法,随后将其应用于深度神经网络内部图像与文本的语义相关性检测。