Information retrieval systems retrieves relevant documents based on a query submitted by the user. The documents are initially indexed and the words in the documents are assigned weights using a weighting technique called TFIDF which is the product of Term Frequency (TF) and Inverse Document Frequency (IDF). TF represents the number of occurrences of a term in a document. IDF measures whether the term is common or rare across all documents. It is computed by dividing the total number of documents in the system by the number of documents containing the term and then computing the logarithm of the quotient. By default, we use base 10 to calculate the logarithm. In this paper, we are going to test this weighting technique by using a range of log bases from 0.1 to 100.0 to calculate the IDF. Testing different log bases for vector model weighting technique is to highlight the importance of understanding the performance of the system at different weighting values. We use the documents of MED, CRAN, NPL, LISA, and CISI test collections that scientists assembled explicitly for experiments in data information retrieval systems.
翻译:信息检索系统根据用户提交的查询检索相关文档。文档被初步索引后,其中的词语会通过一种称为TFIDF的加权技术被赋予权重,该技术是词频(TF)和逆文档频率(IDF)的乘积。TF表示某个词在文档中出现的次数,而IDF则衡量该词在所有文档中是常见还是罕见。IDF的计算方法是用系统中的文档总数除以包含该词的文档数量,然后取所得商的对数。默认情况下,我们使用以10为底的对数进行计算。本文将通过采用0.1到100.0的一系列对数底数来计算IDF,以测试这种加权技术。对不同对数底数下的向量模型加权技术进行测试,旨在强调理解系统在不同权重值下性能的重要性。我们使用了MED、CRAN、NPL、LISA和CISI测试集文档,这些文档是科学家为数据信息检索系统实验专门收集整理的。