Many computational linguistic methods have been proposed to study the information content of languages. We consider two interesting research questions: 1) how is information distributed over long documents, and 2) how does content reduction, such as token selection and text summarization, affect the information density in long documents. We present four criteria for information density estimation for long documents, including surprisal, entropy, uniform information density, and lexical density. Among those criteria, the first three adopt the measures from information theory. We propose an attention-based word selection method for clinical notes and study machine summarization for multiple-domain documents. Our findings reveal the systematic difference in information density of long text in various domains. Empirical results on automated medical coding from long clinical notes show the effectiveness of the attention-based word selection method.
翻译:许多计算语言学方法已被提出用于研究语言的信息内容。我们考虑两个有趣的研究问题:1)信息如何分布在长文档中,以及2)内容压缩(如词元选择和文本摘要)如何影响长文档中的信息密度。我们提出了长文档信息密度估计的四个标准,包括惊讶度、熵、均匀信息密度和词汇密度。其中前三个标准采用了信息论中的度量方法。我们提出了一种基于注意力机制的词选择方法用于临床笔记,并研究了多领域文档的机器摘要。我们的发现揭示了不同领域长文本信息密度的系统性差异。基于长临床笔记的自动化医疗编码实验结果显示了这种基于注意力机制的词选择方法的有效性。