Automatic Detection of Industry Sectors in Legal Articles Using Machine Learning Approaches

from arxiv, 26 pages, 5 figures, 3 tables. Paper was presented at 'Classification and Data Science in the Digital Age', 17th conference of the International Federation of Classification Societies (IFCS2022), Porto, Portugal, https://ifcs2022.fep.up.pt/

The ability to automatically identify industry sector coverage in articles on legal developments, or any kind of news articles for that matter, can bring plentiful of benefits both to the readers and the content creators themselves. By having articles tagged based on industry coverage, readers from all around the world would be able to get to legal news that are specific to their region and professional industry. Simultaneously, writers would benefit from understanding which industries potentially lack coverage or which industries readers are currently mostly interested in and thus, they would focus their writing efforts towards more inclusive and relevant legal news coverage. In this paper, a Machine Learning-powered industry analysis approach which combined Natural Language Processing (NLP) with Statistical and Machine Learning (ML) techniques was investigated. A dataset consisting of over 1,700 annotated legal articles was created for the identification of six industry sectors. Text and legal based features were extracted from the text. Both traditional ML methods (e.g. gradient boosting machine algorithms, and decision-tree based algorithms) and deep neural network (e.g. transformer models) were applied for performance comparison of predictive models. The system achieved promising results with area under the receiver operating characteristic curve scores above 0.90 and F-scores above 0.81 with respect to the six industry sectors. The experimental results show that the suggested automated industry analysis which employs ML techniques allows the processing of large collections of text data in an easy, efficient, and scalable way. Traditional ML methods perform better than deep neural networks when only a small and domain-specific training data is available for the study.

翻译：在法律发展类文章或任何类型新闻文章中自动识别其所涵盖的行业领域，能够为读者和内容创作者双方带来诸多益处。通过对文章进行基于行业覆盖情况的标注，全球各地的读者将能够获取与其所在地区和专业行业相关的法律新闻。同时，写作者也能借此了解哪些行业的报道可能存在缺失，或当前读者最关注哪些行业，从而将写作精力集中于更具包容性和相关性的法律新闻覆盖。本文研究了一种基于机器学习的行业分析方法，该方法将自然语言处理与统计和机器学习技术相结合。我们创建了一个包含超过1,700篇已标注法律文章的数据集，用于识别六个行业领域。从文本中提取了基于文本和基于法律的特征。我们同时应用了传统机器学习方法（例如梯度提升机算法和基于决策树的算法）与深度神经网络（例如Transformer模型）来进行预测模型的性能比较。该系统在六个行业领域上取得了令人满意的结果，受试者工作特征曲线下面积得分均超过0.90，F值均超过0.81。实验结果表明，所提出的采用机器学习技术的自动化行业分析方法能够以简单、高效和可扩展的方式处理大规模文本数据。当可供研究的训练数据规模较小且具有领域特异性时，传统机器学习方法的表现优于深度神经网络。

相关内容

Machine Learning

关注 2251

机器学习（Machine Learning）是一个研究计算学习方法的国际论坛。该杂志发表文章，报告广泛的学习方法应用于各种学习问题的实质性结果。该杂志的特色论文描述研究的问题和方法，应用研究和研究方法的问题。有关学习问题或方法的论文通过实证研究、理论分析或与心理现象的比较提供了坚实的支持。应用论文展示了如何应用学习方法来解决重要的应用问题。研究方法论文改进了机器学习的研究方法。所有的论文都以其他研究人员可以验证或复制的方式描述了支持证据。论文还详细说明了学习的组成部分，并讨论了关于知识表示和性能任务的假设。官网地址：http://dblp.uni-trier.de/db/journals/ml/

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

76+阅读 · 2022年6月28日