Open source software (OSS) licenses regulate the conditions under which users can reuse, modify, and distribute the software legally. However, there exist various OSS licenses in the community, written in a formal language, which are typically long and complicated to understand. In this paper, we conducted a 661-participants online survey to investigate the perspectives and practices of developers towards OSS licenses. The user study revealed an indeed need for an automated tool to facilitate license understanding. Motivated by the user study and the fast growth of licenses in the community, we propose the first study towards automated license summarization. Specifically, we released the first high quality text summarization dataset and designed two tasks, i.e., license text summarization (LTS), aiming at generating a relatively short summary for an arbitrary license, and license term classification (LTC), focusing on the attitude inference towards a predefined set of key license terms (e.g., Distribute). Aiming at the two tasks, we present LiSum, a multi-task learning method to help developers overcome the obstacles of understanding OSS licenses. Comprehensive experiments demonstrated that the proposed jointly training objective boosted the performance on both tasks, surpassing state-of-the-art baselines with gains of at least 5 points w.r.t. F1 scores of four summarization metrics and achieving 95.13% micro average F1 score for classification simultaneously. We released all the datasets, the replication package, and the questionnaires for the community.
翻译:摘要:开源软件(OSS)许可证规定了用户在合法条件下复用、修改和分发软件应遵循的条款。然而,社区中存在多种以正式语言编写的OSS许可证,通常冗长且难以理解。本文通过一项包含661名参与者的在线调查,探究了开发者对OSS许可证的认知与实践。用户研究表明,确实需要一种自动化工具来辅助许可证理解。受该用户研究及社区中许可证数量快速增长现象的启发,我们首次提出了自动化许可证摘要生成研究。具体而言,我们发布了首个高质量文本摘要数据集,并设计了两项任务:许可证文本摘要(LTS),旨在为任意许可证生成相对简短的摘要;以及许可证条款分类(LTC),专注于对预定义的关键许可证条款(如"分发")进行态度推断。针对这两项任务,我们提出了LiSum——一种多任务学习方法,旨在帮助开发者克服理解OSS许可证的障碍。综合实验表明,所提出的联合训练目标提升了两个任务的性能,在四项摘要指标的F1分数上均超越现有最优基线模型至少5个百分点,同时分类任务达到95.13%的微平均F1分数。我们已将全部数据集、复现包及调查问卷向社区开放。