Word differences in news media of lower and higher peace countries revealed by natural language processing and machine learning

Language is both a cause and a consequence of the social processes that lead to conflict or peace. Hate speech can mobilize violence and destruction. What are the characteristics of peace speech that reflect and support the social processes that maintain peace? This study used existing peace indices, machine learning, and on-line, news media sources to identify the words most associated with lower-peace versus higher-peace countries. As each peace index measures different social properties, there is little consensus on the numerical values of these indices. There is however greater consensus with these indices for the countries that are at the extremes of lower-peace and higher-peace. Therefore, a data driven approach was used to find the words most important in distinguishing lower-peace and higher-peace countries. Rather than assuming a theoretical framework that predicts which words are more likely in lower-peace and higher-peace countries, and then searching for those words in news media, in this study, natural language processing and machine learning were used to identify the words that most accurately classified a country as lower-peace or higher-peace. Once the machine learning model was trained on the word frequencies from the extreme lower-peace and higher-peace countries, that model was also used to compute a quantitative peace index for these and other intermediate-peace countries. The model successfully yielded a quantitative peace index for intermediate-peace countries that was in between that of the lower-peace and higher-peace, even though they were not in the training set. This study demonstrates how natural language processing and machine learning can help to generate new quantitative measures of social systems, which in this study, were linguistic differences resulting in a quantitative index of peace for countries at different levels of peacefulness.

翻译：语言既是导致冲突或和平的社会过程的原因，也是其结果。仇恨言论可以煽动暴力和破坏。那么，反映并支持维持和平社会过程的“和平言论”具有哪些特征？本研究利用现有和平指数、机器学习以及在线新闻媒体资源，识别了与低和平国家及高和平国家最相关的词语。由于各和平指数衡量不同的社会属性，这些指数的数值之间鲜有共识。然而，对于处于低和平与高和平极端的国家，这些指数具有较高的一致性。因此，本研究采用数据驱动方法，找出区分低和平与高和平国家最重要的词语。本研究并非假设一个理论框架来预测哪些词语更可能出现在低和平与高和平国家，然后在新闻媒体中搜索这些词语，而是利用自然语言处理与机器学习，识别出能最准确地将国家分类为低和平或高和平的词语。一旦机器学习模型基于极端低和平与高和平国家的词频完成训练，该模型也被用于为这些国家以及处于中间和平水平的国家计算一个量化的和平指数。该模型成功地为中间和平国家生成了一个介于低和平与高和平之间的量化和平指数，尽管这些国家并未包含在训练集中。本研究展示了自然语言处理与机器学习如何帮助生成社会系统的新量化指标——在本研究中，即词语差异最终形成针对不同和平程度国家的和平量化指数。