The effectiveness of compression distance in KNN-based text classification ('gzip') has recently garnered lots of attention. In this note we show that simpler means can also be effective, and compression may not be needed. Indeed, a 'bag-of-words' matching can achieve similar or better results, and is more efficient.
翻译:基于压缩距离的KNN文本分类方法('gzip')的有效性近期受到广泛关注。本文指出,更简单的方法同样有效,且压缩过程并非必要。实际上,采用"词袋"匹配方法能够取得相似或更优的结果,且效率更高。