The ability to automatically identify industry sector coverage in articles on legal developments, or any kind of news articles for that matter, can bring plentiful of benefits both to the readers and the content creators themselves. By having articles tagged based on industry coverage, readers from all around the world would be able to get to legal news that are specific to their region and professional industry. Simultaneously, writers would benefit from understanding which industries potentially lack coverage or which industries readers are currently mostly interested in and thus, they would focus their writing efforts towards more inclusive and relevant legal news coverage. In this paper, a Machine Learning-powered industry analysis approach which combined Natural Language Processing (NLP) with Statistical and Machine Learning (ML) techniques was investigated. A dataset consisting of over 1,700 annotated legal articles was created for the identification of six industry sectors. Text and legal based features were extracted from the text. Both traditional ML methods (e.g. gradient boosting machine algorithms, and decision-tree based algorithms) and deep neural network (e.g. transformer models) were applied for performance comparison of predictive models. The system achieved promising results with area under the receiver operating characteristic curve scores above 0.90 and F-scores above 0.81 with respect to the six industry sectors. The experimental results show that the suggested automated industry analysis which employs ML techniques allows the processing of large collections of text data in an easy, efficient, and scalable way. Traditional ML methods perform better than deep neural networks when only a small and domain-specific training data is available for the study.
翻译:在法律发展类文章或任何类型新闻文章中自动识别其所涵盖的行业领域,能够为读者和内容创作者双方带来诸多益处。通过对文章进行基于行业覆盖情况的标注,全球各地的读者将能够获取与其所在地区和专业行业相关的法律新闻。同时,写作者也能借此了解哪些行业的报道可能存在缺失,或当前读者最关注哪些行业,从而将写作精力集中于更具包容性和相关性的法律新闻覆盖。本文研究了一种基于机器学习的行业分析方法,该方法将自然语言处理与统计和机器学习技术相结合。我们创建了一个包含超过1,700篇已标注法律文章的数据集,用于识别六个行业领域。从文本中提取了基于文本和基于法律的特征。我们同时应用了传统机器学习方法(例如梯度提升机算法和基于决策树的算法)与深度神经网络(例如Transformer模型)来进行预测模型的性能比较。该系统在六个行业领域上取得了令人满意的结果,受试者工作特征曲线下面积得分均超过0.90,F值均超过0.81。实验结果表明,所提出的采用机器学习技术的自动化行业分析方法能够以简单、高效和可扩展的方式处理大规模文本数据。当可供研究的训练数据规模较小且具有领域特异性时,传统机器学习方法的表现优于深度神经网络。