We study sentence-level identification of the 19 values in the Schwartz motivational continuum as a concrete formulation of human value detection in text. The setting - out-of-context sentences from news and political manifestos - features sparse moral cues and severe class imbalance. This combination makes fine-grained sentence-level value detection intrinsically difficult, even for strong modern neural models. We first operationalize a binary moral presence task ("does any value appear?") and show that it is learnable from single sentences (positive-class F1 $\approx$ 0.74 with calibrated thresholds). We then compare a presence-gated hierarchy to a direct multi-label classifier under matched compute, both based on DeBERTa-base and augmented with lightweight signals (prior-sentence context, LIWC-22/eMFD/MJD lexica, and topic features). The hierarchy does not outperform direct prediction, indicating that gate recall limits downstream gains. We also benchmark instruction-tuned LLMs - Gemma 2 9B, Llama 3.1 8B, Mistral 8B, and Qwen 2.5 7B - in zero-/few-shot and QLoRA setups and build simple ensembles; a soft-vote supervised ensemble reaches macro-F1 0.332, significantly surpassing the best single supervised model and exceeding prior English-only baselines. Overall, in this scenario, lightweight signals and small ensembles yield the most reliable improvements, while hierarchical gating offers limited benefit. We argue that, under an 8 GB single-GPU constraint and at the 7-9B scale, carefully tuned supervised encoders remain a strong and compute-efficient baseline for structured human value detection, and we outline how richer value structure and sentence-in-document context could further improve performance.
翻译:本研究将施瓦茨动机连续统中的19种价值观在句子层面的识别作为文本中人类价值观检测的具体实现形式。该设定——来自新闻和政治宣言的脱离上下文单句——具有稀疏的道德线索和严重的类别不平衡特征。这种组合使得细粒度的句子级价值观检测本质上具有挑战性,即使对于强大的现代神经模型也是如此。我们首先将二元道德存在性任务("是否存在任何价值观?")操作化,并证明其可从单句中学习(通过校准阈值,正类F1值约0.74)。随后在匹配计算量下,比较了基于DeBERTa-base架构并增强轻量级信号(前句上下文、LIWC-22/eMFD/MJD词典特征及主题特征)的存在性门控层级模型与直接多标签分类器。层级模型未超越直接预测效果,表明门控召回率限制了下游性能提升。我们还对指令调优的大语言模型——Gemma 2 9B、Llama 3.1 8B、Mistral 8B及Qwen 2.5 7B——在零样本/少样本和QLoRA设置中进行基准测试,并构建了简单集成模型;软投票监督集成模型达到宏观F1值0.332,显著超越最佳单监督模型,并超过先前仅英语的基线。总体而言,在此场景下,轻量级信号与小规模集成能产生最可靠的改进,而层级门控机制收益有限。我们认为,在8GB单GPU约束和7-9B参数规模下,精心调优的监督编码器仍然是结构化人类价值观检测的强大且计算高效的基线,并进一步阐述了更丰富的价值结构和文档内句子上下文如何能进一步提升性能。