Shannon Information theory has achieved great success in not only communication technology where it was originally developed for but also many other science and engineering fields such as machine learning and artificial intelligence. Inspired by the famous weighting scheme TF-IDF, we discovered that information entropy has a natural dual. We complement the classical Shannon information theory by proposing a novel quantity, namely troenpy. Troenpy measures the certainty, commonness and similarity of the underlying distribution. To demonstrate its usefulness, we propose a troenpy based weighting scheme for document with class labels, namely positive class frequency (PCF). On a collection of public datasets we show the PCF based weighting scheme outperforms the classical TF-IDF and a popular Optimal Transportation based word moving distance algorithm in a kNN setting. We further developed a new odds-ratio type feature, namely Expected Class Information Bias(ECIB), which can be regarded as the expected odds ratio of the information quantity entropy and troenpy. In the experiments we observe that including the new ECIB features and simple binary term features in a simple logistic regression model can further significantly improve the performance. The simple new weighting scheme and ECIB features are very effective and can be computed with linear order complexity.
翻译:香农信息论不仅在最初发展的通信技术领域取得了巨大成功,还在机器学习与人工智能等众多科学与工程领域成就斐然。受著名加权方案TF-IDF的启发,我们发现信息熵存在一种自然的对偶形式。通过提出一个名为Troenpy的新型量,我们补充了经典香农信息论。Troenpy衡量的是底层分布的确定性、共性及相似性。为展示其效用,我们提出一种基于Troenpy的文档加权方案——正类频率(PCF)。在一系列公开数据集上,我们展示了基于PCF的加权方案在k近邻(kNN)设置中优于经典TF-IDF和一种流行的基于最优传输的词移动距离算法。我们进一步开发了一种新的比值比型特征——期望类信息偏差(ECIB),该特征可视为信息量熵与Troenpy的期望比值。在实验中,我们观察到,在简单逻辑回归模型中加入新的ECIB特征与简单的二值词项特征能进一步显著提升性能。这种简单的新型加权方案及ECIB特征非常有效,且能以线性时间复杂度计算。