Personal values are a crucial factor behind human decision-making. Considering that Large Language Models (LLMs) have been shown to impact human decisions significantly, it is essential to make sure they accurately understand human values to ensure their safety. However, evaluating their grasp of these values is complex due to the value's intricate and adaptable nature. We argue that truly understanding values in LLMs requires considering both "know what" and "know why". To this end, we present a comprehensive evaluation metric, ValueDCG (Value Discriminator-Critique Gap), to quantitatively assess the two aspects with an engineering implementation. We assess four representative LLMs and provide compelling evidence that the growth rates of LLM's "know what" and "know why" capabilities do not align with increases in parameter numbers, resulting in a decline in the models' capacity to understand human values as larger amounts of parameters. This may further suggest that LLMs might craft plausible explanations based on the provided context without truly understanding their inherent value, indicating potential risks.
翻译:个人价值观是人类决策背后的关键因素。考虑到大型语言模型已被证明能显著影响人类决策,确保其准确理解人类价值观对保障其安全性至关重要。然而,由于价值观具有复杂性和适应性,评估模型对价值观的掌握程度十分困难。我们认为,LLMs要真正理解价值观,必须同时考量“知其然”与“知其所以然”。为此,我们提出一种综合性评估指标——ValueDCG(价值观判别-批判差距),通过工程实现定量评估这两个方面。我们评估了四个代表性LLM,并提供了有力证据表明:随着参数数量增加,LLM的“知其然”与“知其所以然”能力增长率并不匹配,导致模型理解人类价值观的能力随参数规模扩大而下降。这可能进一步表明,LLMs或许能在未真正理解内在价值观的情况下,根据给定语境构建看似合理的解释,这暗示着潜在风险。