In this paper, we test whether Multimodal Large Language Models (MLLMs) can match human-subject performance in tasks involving the perception of properties in network layouts. Specifically, we replicate a human-subject experiment about perceiving quality (namely stress) in network layouts using GPT-4o, Gemini-2.5 and Qwen2.5. Our experiments show that giving MLLMs the same study information as trained human participants yields performance comparable to that of human experts and exceeds that of untrained non-experts. Additionally, we show that prompt engineering that deviates from the human-subject experiment can lead to better-than-human performance in some settings. Interestingly, like human subjects, the MLLMs seem to rely on visual proxies rather than computing the actual value of stress, indicating some sense or facsimile of perception. Explanations from the models are similar to those used by the human participants (e.g., an even distribution of nodes and uniform edge lengths).
翻译:暂无翻译