Natural Language Processing is rapidly evolving into a primary instrument for Computational Social Science, with researchers increasingly using embeddings to measure latent constructs such as novelty, creativity, and bias. However, this transition faces a fundamental validity challenge: the ''Proxy Presumption,'' or the reliance on geometric properties (e.g., cosine distance) as direct measures of social concepts. We argue that without explicit validation, unsupervised representations remain entangled mixtures of the target construct ($C$) and confounding attributes ($Z$) like topic, style, and authorship. To bridge the gap between semantic embeddings and valid social measures, we introduce the Construct Validity Protocol (CVP). Drawing on causal representation learning and psychometrics, the CVP offers a rigorous pipeline from conceptualization to quantitative verification. We further propose Counterfactual Neutralization, a novel method using LLMs to reduce confounding in embedding space. By providing a standardized Validity Suite -- including tests for discriminant, incremental, and predictive validity -- this work offers the community a toolkit to transform heuristic proxies into robust, scientifically defensible instruments.
翻译:自然语言处理正迅速发展为计算社会科学的主要工具,研究者越来越多地使用嵌入来测量新颖性、创造力和偏见等潜在构念。然而,这种转变面临一个根本性的效度挑战:“代理推定”,即依赖几何属性(如余弦距离)作为社会概念的直接度量。我们认为,若无明确验证,无监督表示仍然是目标构念($C$)与主题、风格、作者身份等混淆属性($Z$)的纠缠混合。为弥合语义嵌入与有效社会度量之间的鸿沟,我们引入了构念效度协议。该协议借鉴因果表示学习与心理测量学,提供了一个从概念化到定量验证的严格流程。我们进一步提出反事实中和法,这是一种利用大型语言模型减少嵌入空间混淆的新方法。通过提供标准化的效度套件——包括区分效度、增量效度与预测效度测试——本研究为学界提供了一套工具集,旨在将启发式代理转化为稳健、科学上可辩护的测量工具。