GenZ: Foundational models as latent variable generators within traditional statistical models

We present GenZ, a hybrid model that bridges foundational models and statistical modeling through interpretable semantic features. While large language models possess broad domain knowledge, they often fail to capture dataset-specific patterns critical for prediction tasks. Our approach addresses this by discovering semantic feature descriptions through an iterative process that contrasts groups of items identified via statistical modeling errors, rather than relying solely on the foundational model's domain understanding. We formulate this as a generalized EM algorithm that jointly optimizes semantic feature descriptors and statistical model parameters. The method prompts a frozen foundational model to classify items based on discovered features, treating these judgments as noisy observations of latent binary features that predict real-valued targets through learned statistical relationships. We demonstrate the approach on two domains: house price prediction (hedonic regression) and cold-start collaborative filtering for movie recommendations. On house prices, our model achieves 12\% median relative error using discovered semantic features from multimodal listing data, substantially outperforming a GPT-5 baseline (38\% error) that relies on the LLM's general domain knowledge. For Netflix movie embeddings, our model predicts collaborative filtering representations with 0.59 cosine similarity purely from semantic descriptions -- matching the performance that would require approximately 4000 user ratings through traditional collaborative filtering. The discovered features reveal dataset-specific patterns (e.g., architectural details predicting local housing markets, franchise membership predicting user preferences) that diverge from the model's domain knowledge alone.

翻译：我们提出GenZ，一种通过可解释语义特征桥接基础模型与统计建模的混合模型。尽管大语言模型具备广泛的领域知识，但其往往难以捕捉对预测任务至关重要的数据集特定模式。我们的方法通过迭代过程发现语义特征描述来解决这一问题：该过程基于统计建模误差识别的项目组进行对比，而非单纯依赖基础模型的领域理解。我们将此形式化为一个广义EM算法，联合优化语义特征描述符与统计模型参数。该方法提示冻结的基础模型根据发现的特征对项目进行分类，并将这些判断视为潜在二元特征的噪声观测值，这些特征通过学习到的统计关系预测实值目标。我们在两个领域验证了该方法：房价预测（特征价格回归）和电影推荐的冷启动协同过滤。在房价预测中，我们的模型利用从多模态房源数据中发现的语义特征实现了12%的中位数相对误差，显著优于依赖LLM通用领域知识的GPT-5基线模型（38%误差）。对于Netflix电影嵌入，我们的模型仅从语义描述就能以0.59余弦相似度预测协同过滤表示——这相当于传统协同过滤需要约4000条用户评分才能达到的性能。发现的特征揭示了数据集特定模式（例如预测本地房产市场的建筑细节、预测用户偏好的系列电影成员关系），这些模式与模型自身领域知识存在显著差异。