Local news articles are a subset of news that impact users in a geographical area, such as a city, county, or state. Detecting local news (Step 1) and subsequently deciding its geographical location as well as radius of impact (Step 2) are two important steps towards accurate local news recommendation. Naive rule-based methods, such as detecting city names from the news title, tend to give erroneous results due to lack of understanding of the news content. Empowered by the latest development in natural language processing, we develop an integrated pipeline that enables automatic local news detection and content-based local news recommendations. In this paper, we focus on Step 1 of the pipeline, which highlights: (1) a weakly supervised framework incorporated with domain knowledge and auto data processing, and (2) scalability to multi-lingual settings. Compared with Stanford CoreNLP NER model, our pipeline has higher precision and recall evaluated on a real-world and human-labeled dataset. This pipeline has potential to more precise local news to users, helps local businesses get more exposure, and gives people more information about their neighborhood safety.
翻译:本地新闻是影响特定地理区域(如城市、县或州)用户的一类新闻子集。检测本地新闻(步骤1),随后确定其地理位置及影响范围(步骤2),是实现精准本地新闻推荐的两个重要步骤。基于规则的朴素方法(例如从新闻标题中检测城市名称)因缺乏对新闻内容的理解,往往会产生错误结果。借助自然语言处理领域的最新发展,我们构建了一个集成化流程,能够实现自动化的本地新闻检测与基于内容的本地新闻推荐。本文聚焦于流程中的步骤1,重点介绍了:(1)一种融合领域知识与自动数据处理的弱监督框架,以及(2)其向多语言场景的可扩展性。与Stanford CoreNLP NER模型相比,我们的流程在真实人工标注数据集上评估时,展现出更高的精确率与召回率。该流程有望向用户提供更精准的本地新闻,助力本地企业获得更多曝光,并使民众更了解所在社区的安全信息。