With the increasing proliferation of mobile applications in our everyday experiences, the concerns surrounding ethics have surged significantly. Users generally communicate their feedback, report issues, and suggest new functionalities in application (app) reviews, frequently emphasizing safety, privacy, and accountability concerns. Incorporating these reviews is essential to developing successful products. However, app reviews related to ethical concerns generally use domain-specific language and are expressed using a more varied vocabulary. Thus making automated ethical concern-related app review extraction a challenging and time-consuming effort. This study proposes a novel Natural Language Processing (NLP) based approach that combines Natural Language Inference (NLI), which provides a deep comprehension of language nuances, and a decoder-only (LLaMA-like) Large Language Model (LLM) to extract ethical concern-related app reviews at scale. Utilizing 43,647 app reviews from the mental health domain, the proposed methodology 1) Evaluates four NLI models to extract potential privacy reviews and compares the results of domain-specific privacy hypotheses with generic privacy hypotheses; 2) Evaluates four LLMs for classifying app reviews to privacy concerns; and 3) Uses the best NLI and LLM models further to extract new privacy reviews from the dataset. Results show that the DeBERTa-v3-base-mnli-fever-anli NLI model with domain-specific hypotheses yields the best performance, and Llama3.1-8B-Instruct LLM performs best in the classification of app reviews. Then, using NLI+LLM, an additional 1,008 new privacy-related reviews were extracted that were not identified through the keyword-based approach in previous research, thus demonstrating the effectiveness of the proposed approach.
翻译:随着移动应用在日常生活中的日益普及,围绕伦理问题的关切显著增加。用户通常在应用评论中反馈意见、报告问题并提出新功能建议,其中经常强调安全性、隐私性和责任性等问题。整合这些评论对于开发成功的产品至关重要。然而,与伦理关切相关的应用评论通常使用领域特定语言,且表达词汇更为多样,这使得自动提取相关评论成为一项具有挑战性且耗时的工作。本研究提出了一种新颖的基于自然语言处理的方法,该方法结合了能够深入理解语言细微差别的自然语言推理模型和仅解码器架构(类似LLaMA)的大语言模型,以大规模提取与伦理关切相关的应用评论。利用来自心理健康领域的43,647条应用评论,所提出的方法:1)评估了四种NLI模型以提取潜在的隐私相关评论,并比较了领域特定隐私假设与通用隐私假设的结果;2)评估了四种LLM对应用评论进行隐私关切分类的性能;3)进一步使用最佳NLI和LLM模型从数据集中提取新的隐私相关评论。结果表明,采用领域特定假设的DeBERTa-v3-base-mnli-fever-anli NLI模型性能最佳,而Llama3.1-8B-Instruct LLM在应用评论分类中表现最优。随后,通过NLI+LLM组合方法,额外提取了1,008条新的隐私相关评论,这些评论在先前研究中未通过基于关键词的方法识别,从而证明了所提方法的有效性。