Human value detection from single sentences is a sparse, imbalanced multi-label task. We study whether Schwartz higher-order (HO) categories help this setting on ValueEval'24 / ValuesML (74K English sentences) under a compute-frugal budget. Rather than proposing a new architecture, we compare direct supervised transformers, hard HO$\rightarrow$values pipelines, Presence$\rightarrow$HO$\rightarrow$values cascades, compact instruction-tuned large language models (LLMs), QLoRA, and low-cost upgrades such as threshold tuning and small ensembles. HO categories are learnable: the easiest bipolar pair, Growth vs. Self-Protection, reaches Macro-$F_1=0.58$. The most reliable gains come from calibration and ensembling: threshold tuning improves Social Focus vs. Personal Focus from $0.41$ to $0.57$ ($+0.16$), transformer soft voting lifts Growth from $0.286$ to $0.303$, and a Transformer+LLM hybrid reaches $0.353$ on Self-Protection. In contrast, hard hierarchical gating does not consistently improve the end task. Compact LLMs also underperform supervised encoders as stand-alone systems, although they sometimes add useful diversity in hybrid ensembles. Under this benchmark, the HO structure is more useful as an inductive bias than as a rigid routing rule.
翻译:从单句检测人类价值是一项稀疏且不平衡的多标签任务。本研究在计算资源受限条件下,基于ValueEval'24/ValuesML数据集(7.4万条英文句子),探究施瓦茨高阶(HO)类别是否有助于此任务。我们并未提出新架构,而是系统比较了以下方法:直接监督训练的Transformer模型、硬性HO→价值流水线、存在性→HO→价值级联流程、紧凑型指令微调大语言模型(LLM)、QLoRA,以及低成本改进方案(如阈值调优与小规模集成)。实验表明HO类别具有可学习性:最易区分的两极类别"成长vs.自我保护"的宏平均F1值达到0.58。最稳定的性能提升来自校准与集成技术:阈值调优将"社会关注vs.个人关注"的F1值从0.41提升至0.57(+0.16),Transformer软投票集成使"成长"类别F1值从0.286提升至0.303,而Transformer+LLM混合模型在"自我保护"类别上达到0.353。相比之下,硬性分层门控机制并未对终端任务产生持续改进。紧凑型LLM作为独立系统表现逊于监督编码器,但在混合集成中有时能提供有效的多样性。本基准测试表明,HO结构更适合作为归纳偏置而非刚性路由规则。