Human Values in a Single Sentence: Moral Presence, Hierarchies, and Transformer Ensembles on the Schwartz Continuum

We study sentence-level detection of the 19 human values in the refined Schwartz continuum in about 74k English sentences from news and political manifestos (ValueEval'24 corpus). Each sentence is annotated with value presence, yielding a binary moral-presence label and a 19-way multi-label task under severe class imbalance. First, we show that moral presence is learnable from single sentences: a DeBERTa-base classifier attains positive-class F1 = 0.74 with calibrated thresholds. Second, we compare direct multi-label value detectors with presence-gated hierarchies in a setting where only a single consumer-grade GPU with 8 GB of VRAM is available, and we explicitly choose all training and inference configurations to fit within this budget. Presence gating does not improve over direct prediction, indicating that gate recall becomes a bottleneck. Third, we investigate lightweight auxiliary signals - short-range context, LIWC-22, and moral lexica - and small ensembles. Our best supervised configuration, a soft-voting ensemble of DeBERTa-based models enriched with such signals, reaches macro-F1 = 0.332 on the 19 values, improving over the best previous English-only baseline on this corpus, namely the best official ValueEval'24 English run (macro-F1 = 0.28 on the same 19-value test set). Methodologically, our study provides, to our knowledge, the first systematic comparison of direct versus presence-gated architectures, lightweight feature-augmented encoders, and medium-sized instruction-tuned Large Language Models (LLMs) for refined Schwartz values at sentence level. We additionally benchmark 7-9B instruction-tuned LLMs (Gemma 2 9B, Llama 3.1 8B, Mistral 8B, Qwen 2.5 7B) in zero-/few-shot and QLoRA setups, and find that they lag behind the supervised ensemble under the same compute budget. Overall, our results provide empirical guidance for building compute-efficient, value-aware NLP models.

翻译：本研究基于约74,000条来自新闻和政治宣言的英文句子（ValueEval'24语料库），在细粒度施瓦茨连续统中对19种人类价值观进行句子级检测。每个句子均标注了价值观存在性，形成二元道德存在性标签及严重类别不平衡下的19类多标签任务。首先，我们证明道德存在性可从单句学习：DeBERTa-base分类器通过校准阈值获得正类F1=0.74。其次，在仅配备8GB显存的消费级单GPU环境下，我们系统比较了直接多标签价值观检测器与存在性门控层级模型，所有训练与推理配置均严格限定在此算力预算内。存在性门控未能超越直接预测，表明门控召回率成为性能瓶颈。第三，我们探索了轻量级辅助信号——短程上下文、LIWC-22词典和道德词典——以及小型集成策略。最优监督配置为基于DeBERTa的软投票集成模型（融合上述信号），在19类价值观上达到宏观F1=0.332，超越了该语料库先前最佳英文基线（即官方ValueEval'24最佳英文运行结果：相同19类测试集宏观F1=0.28）。方法论层面，本研究首次系统比较了直接预测与存在性门控架构、轻量级特征增强编码器、中等规模指令微调大语言模型在细粒度施瓦茨价值观句子级检测中的表现。我们进一步评估了7-9B参数指令微调LLM（Gemma 2 9B、Llama 3.1 8B、Mistral 8B、Qwen 2.5 7B）在零样本/少样本及QLoRA设置下的性能，发现其在相同算力预算下均落后于监督集成模型。总体而言，本研究为构建计算高效、具备价值观感知能力的NLP模型提供了实证指导。