The problem of analyzing data streams of very large volumes is important and is very desirable for many application domains. In this paper we present and demonstrate effective working of an algorithm to find clusters and anomalous data points in a streaming datasets. Entropy minimization is used as a criterion for defining and updating clusters formed from a streaming dataset. As the clusters are formed we also identify anomalous datapoints that show up far away from all known clusters. With a number of 2-D datasets we demonstrate the effectiveness of discovering the clusters and also identifying anomalous data points.
翻译:分析大规模数据流的问题具有重要意义,并且在许多应用领域中非常受关注。本文提出并展示了一种在流式数据集中发现聚类和异常数据点的有效算法。熵最小化被用作定义和更新从流式数据集中形成的聚类的准则。在形成聚类的同时,我们还识别出远离所有已知聚类的异常数据点。通过多个二维数据集,我们证明了该算法在发现聚类和识别异常数据点方面的有效性。