Human communication is often implicit, conveying tone, identity, and intent beyond literal meanings. While large language models have achieved strong performance on explicit tasks such as summarization and reasoning, their capacity for expressivity, or implicit communication, remains underexplored. We introduce \textbf{ExpressivityBench}, a framework for evaluating the expressivity of LLMs using information-theoretic communication models. Our approach quantifies how well LLM-generated text communicates target properties without explicit mention, across nine tasks spanning emotion, identity, and tone. To enable scalable and reproducible evaluation, we employ LLM-based graders validated against human judgments. Our results reveal that while models are adept at expressing affective content, they struggle with sociolinguistic signals, lagging behind human baselines. This study provides a necessary step to evaluate human-like implicit communication, with implications for applications such as education, mental health support, and socially-aware dialogue systems. We provide code and data for our benchmark alongside our paper.
翻译:人类交流常常是隐式的,在字面意义之外传递着语气、身份和意图。尽管大语言模型在摘要生成和推理等显式任务上表现出色,但其表达能力——即隐式交流能力——仍未得到充分探索。我们提出了\textbf{ExpressivityBench},一个基于信息论通信模型评估大语言模型表达能力的框架。该方法量化了LLM生成的文本在九个涵盖情感、身份和语气的任务中,如何在不明确提及的情况下有效传达目标属性。为实现可扩展且可复现的评估,我们采用了基于LLM的评分器,并通过人工判断进行验证。研究结果表明,虽然模型擅长表达情感内容,但在社会语言学信号方面存在困难,表现落后于人类基线。本研究为评估类人隐式交流提供了必要的一步,对教育、心理健康支持和社会感知对话系统等应用具有启示意义。我们随论文提供了基准测试的代码和数据。