We outline an unsupervised method for temporal rank ordering of sets of historical documents, namely American State of the Union Addresses and DEEDS, a corpus of medieval English property transfer documents. Our method relies upon effectively capturing the gradual change in word usage via a bandwidth estimate for the non-parametric Generalized Linear Models (Fan, Heckman, and Wand, 1995). The number of possible rank orders needed to search through possible cost functions related to the bandwidth can be quite large, even for a small set of documents. We tackle this problem of combinatorial optimization using the Simulated Annealing algorithm, which allows us to obtain the optimal document temporal orders. Our rank ordering method significantly improved the temporal sequencing of both corpora compared to a randomly sequenced baseline. This unsupervised approach should enable the temporal ordering of undated document sets.
翻译:我们提出了一种无监督方法,用于对历史文献集合进行时间排序,具体研究对象包括美国国情咨文和英国中世纪财产转让文件语料库DEEDS。该方法的核心在于利用非参数广义线性模型的带宽估计(Fan, Heckman, and Wand, 1995),有效捕捉词汇使用的渐进变化。即使在文档数量较少的情况下,与带宽相关的代价函数可能对应的排序序列数量仍可能非常庞大。我们采用模拟退火算法解决这一组合优化问题,从而获得最优的文档时间顺序。与随机排序的基线相比,本方法显著改善了两个语料库的时间序列排序效果。这种无监督方法有望实现对未标注日期文档集合的时间顺序排列。