In this paper we consider data storage from a probabilistic point of view and obtain bounds for efficient storage in the presence of feature selection and undersampling, both of which are important from the data science perspective. First, we consider encoding of correlated sources for nonstationary data and obtain a Slepian-Wolf type result for the probability of error. We then reinterpret our result by allowing one source to be the set of features to be discarded and other source to be remaining data to be encoded. Next, we consider neighbourhood domination in random graphs where we impose the condition that a fraction of neighbourhood must be present for each vertex and obtain optimal bounds on the minimum size of such a set. We show how such sets are useful for data undersampling in the presence of imbalanced datasets and briefly illustrate our result using~\(k-\)nearest neighbours type classification rules as an example.
翻译:本文从概率角度研究数据存储问题,针对特征选择与欠采样这两种数据科学中的重要场景,获得了高效存储的界。首先,我们考虑非平稳数据中相关信源的编码问题,得到了误差概率的Slepian-Wolf型结果。随后通过将其中一个信源解释为待丢弃的特征集、另一个信源解释为待编码的剩余数据,对结果进行了重新阐释。接着,我们研究随机图中的邻域支配问题,要求每个顶点必须存在一定比例的邻域,并获得了此类集合最小规模的最优界。我们展示了这类集合如何用于非平衡数据集中的欠采样,并简要通过~\(k-\)近邻分类规则示例说明该结果。