Human Values in a Single Sentence: Moral Presence, Hierarchies, and Transformer Ensembles on the Schwartz Continuum

We study sentence-level detection of the 19 human values in the refined Schwartz continuum in about 74k English sentences from news and political manifestos (ValueEval'24 corpus). Each sentence is annotated with value presence, yielding a binary moral-presence label and a 19-way multi-label task under severe class imbalance. First, we show that moral presence is learnable from single sentences: a DeBERTa-base classifier attains positive-class F1 = 0.74 with calibrated thresholds. Second, we compare direct multi-label value detectors with presence-gated hierarchies under a single 8 GB GPU budget. Under matched compute, presence gating does not improve over direct prediction, indicating that gate recall becomes a bottleneck. Third, we investigate lightweight auxiliary signals - short-range context, LIWC-22 and moral lexica, and topic features - and small ensembles. Our best supervised configuration, a soft-voting ensemble of DeBERTa-based models enriched with such signals, reaches macro-F1 = 0.332 on the 19 values, improving over the best previous English-only baseline on this corpus (macro-F1 $\approx$ 0.28). We additionally benchmark 7-9B instruction-tuned LLMs (Gemma 2 9B, Llama 3.1 8B, Mistral 8B, Qwen 2.5 7B) in zero-/few-shot and QLoRA setups, and find that they lag behind the supervised ensemble under the same hardware constraint. Overall, our results provide empirical guidance for building compute-efficient, value-aware NLP models under realistic GPU budgets.

翻译：本研究基于约74,000条来自新闻和政治宣言的英文句子（ValueEval'24语料库），在细化的施瓦茨连续统上对19种普世价值观进行句子级检测。每个句子均标注了价值存在性，由此产生二元道德存在性标签，并构成一个存在严重类别不平衡的19类多标签任务。首先，我们证明道德存在性可从单句中学习：基于DeBERTa-base的分类器通过校准阈值实现了正类F1=0.74。其次，在单张8GB GPU的算力约束下，我们比较了直接多标签价值检测器与存在性门控层级模型。在同等计算条件下，存在性门控未能超越直接预测，表明门控召回率成为性能瓶颈。第三，我们研究了轻量级辅助信号——短程上下文、LIWC-22与道德词典、主题特征——以及小型集成方法。我们最优的监督配置方案是：基于DeBERTa的模型通过软投票集成并融合上述信号，在19类价值观上达到宏观F1=0.332，较此前该语料库上最优的纯英文基线（宏观F1 $\approx$ 0.28）有所提升。此外，我们在零样本/少样本及QLoRA设置下对7-9B指令微调大语言模型（Gemma 2 9B、Llama 3.1 8B、Mistral 8B、Qwen 2.5 7B）进行基准测试，发现其在相同硬件约束下均落后于监督集成模型。总体而言，我们的研究结果为在现实GPU算力限制下构建计算高效、价值感知的自然语言处理模型提供了实证指导。