What should HCI scholars consider when reporting and reviewing papers that involve LLM-integrated systems? We interview 18 authors of LLM-integrated system papers on their authoring and reviewing experiences. We find that norms of trust-building between authors and reviewers appear to be eroded by the uncertainty of LLM behavior and hyperbolic rhetoric surrounding AI. Authors perceive that reviewers apply uniquely skeptical and inconsistent standards towards papers that report LLM-integrated systems, and mitigate mistrust by adding technical evaluations, justifying usage, and de-emphasizing LLM presence. Authors' views challenge blanket directives to report all prompts and use open models, arguing that prompt reporting is context-dependent and justifying proprietary model usage despite ethical concerns. Finally, some tensions in peer review appear to stem from clashes between the norms and values of HCI and ML/NLP communities, particularly around what constitutes a contribution and an appropriate level of technical rigor. Based on our findings and additional feedback from six expert HCI researchers, we present a set of guidelines and considerations for authors, reviewers, and HCI communities around reporting and reviewing papers that involve LLM-integrated systems.
翻译:人机交互学者在报告和评审涉及LLM集成系统的论文时应考虑哪些因素?我们访谈了18位LLM集成系统论文作者,了解其写作与评审经验。研究发现,由于LLM行为的不确定性及围绕人工智能的夸张论述,作者与评审者之间的信任构建规范似乎受到侵蚀。作者认为评审者对报告LLM集成系统的论文采用了异常怀疑且不一致的标准,并通过增加技术评估、论证使用合理性、弱化LLM存在感等方式缓解不信任。作者的观点挑战了要求报告所有提示词和使用开源模型的笼统指令,主张提示词报告需结合具体情境,并为使用具有伦理争议的专有模型提供辩护。最后,同行评审中的某些矛盾似乎源于人机交互与机器学习/自然语言处理领域在规范与价值观上的冲突,特别是在何为学术贡献以及何种技术严谨程度合适等问题上。基于研究发现及六位人机交互专家的补充反馈,我们为作者、评审者及人机交互学界提出了一套关于报告和评审LLM集成系统论文的指南与考量框架。