Entities involve important concepts with concrete meanings and play important roles in numerous linguistic tasks. Entities have different forms in different tasks and researchers treat those forms as different concepts. In this paper, we are curious to know whether there are some common characteristics connecting those different forms of entities. Specifically, we investigate the underlying distributions of entities from different types and different languages, trying to figure out some common properties behind those diverse entities. We find from twelve datasets about different types of entities and eighteen datasets about different languages of entities that although these entities are dramatically diverse from each in many aspects, their length-frequencies can be well characterized by Marshall-Olkin power-law (MOPL) distributions, and these distributions possess defined means and finite variances. Our experiments show that while not all the entities are drawn from the same underlying population, those entities under same types tend to be drawn from the same distribution. Our experiments also show that Marshall-Olkin power-law models characterize the length-frequencies of entities much better than pure power-law models and log-normal models.
翻译:实体是承载具体含义的重要概念,在诸多语言任务中发挥着关键作用。不同任务中实体存在不同形式,研究者将这些形式视为不同概念。本文旨在探究这些不同形式的实体之间是否存在某些共同特征。具体而言,我们考察了来自不同类型和不同语言的实体的潜在分布规律,试图揭示这些多样化实体背后的共同属性。通过分析涵盖不同实体类型的12个数据集及不同语言的18个数据集发现:尽管这些实体在诸多方面存在显著差异,但其长度频率可被Marshall-Olkin幂律分布精确刻画,且这些分布具有确定的均值和有限方差。实验表明,虽然并非所有实体都来自同一总体,但相同类型的实体往往具有相同的分布特征。此外,Marshall-Olkin幂律模型对实体长度频率的表征能力显著优于纯幂律模型和对数正态模型。