Machine-assisted mixed methods: augmenting humanities and social sciences with artificial intelligence

The increasing capacities of large language models (LLMs) present an unprecedented opportunity to scale up data analytics in the humanities and social sciences, augmenting and automating qualitative analytic tasks previously typically allocated to human labor. This contribution proposes a systematic mixed methods framework to harness qualitative analytic expertise, machine scalability, and rigorous quantification, with attention to transparency and replicability. 16 machine-assisted case studies are showcased as proof of concept. Tasks include linguistic and discourse analysis, lexical semantic change detection, interview analysis, historical event cause inference and text mining, detection of political stance, text and idea reuse, genre composition in literature and film; social network inference, automated lexicography, missing metadata augmentation, and multimodal visual cultural analytics. In contrast to the focus on English in the emerging LLM applicability literature, many examples here deal with scenarios involving smaller languages and historical texts prone to digitization distortions. In all but the most difficult tasks requiring expert knowledge, generative LLMs can demonstrably serve as viable research instruments. LLM (and human) annotations may contain errors and variation, but the agreement rate can and should be accounted for in subsequent statistical modeling; a bootstrapping approach is discussed. The replications among the case studies illustrate how tasks previously requiring potentially months of team effort and complex computational pipelines, can now be accomplished by an LLM-assisted scholar in a fraction of the time. Importantly, this approach is not intended to replace, but to augment researcher knowledge and skills. With these opportunities in sight, qualitative expertise and the ability to pose insightful questions have arguably never been more critical.

翻译：大型语言模型（LLMs）不断增强的能力为人文与社会科学领域的数据分析规模化提供了前所未有的机遇，能够增强并自动化此前通常由人类劳动力承担的定性分析任务。本文提出了一套系统的混合方法框架，旨在整合定性分析专业知识、机器可扩展性及严格量化，同时兼顾透明性与可复现性。16项机器辅助案例研究作为概念验证加以展示，任务涵盖：语言与话语分析、词汇语义变化检测、访谈分析、历史事件因果推断与文本挖掘、政治立场检测、文本与思想复用、文学与电影体裁构成分析、社交网络推断、自动化词典编纂、缺失元数据增强，以及多模态视觉文化分析。与新兴LLM应用文献中普遍聚焦于英语不同，本文中的许多案例涉及小语种及易受数字化失真的历史文本。在除极少数需要专家知识的复杂任务之外，生成式LLM已被证明可充当有效的研究工具。LLM（及人类）标注可能存在误差与变异，但在后续统计建模中必须且能够合理处理一致率问题；本文讨论了自助法（bootstrap）方法。案例研究中的复现结果说明，此前需要数月团队协作及复杂计算流程的任务，现在可由LLM辅助的研究者在极短时间内完成。重要之处在于，该方法并非旨在取代，而是增强研究者的知识与技能。面对这些机遇，定性分析能力与提出深刻问题的能力无疑是比以往任何时候都更为关键的。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日