The five-dollar model is a lightweight text-to-image generative architecture that generates low dimensional images from an encoded text prompt. This model can successfully generate accurate and aesthetically pleasing content in low dimensional domains, with limited amounts of training data. Despite the small size of both the model and datasets, the generated images are still able to maintain the encoded semantic meaning of the textual prompt. We apply this model to three small datasets: pixel art video game maps, video game sprite images, and down-scaled emoji images and apply novel augmentation strategies to improve the performance of our model on these limited datasets. We evaluate our models performance using cosine similarity score between text-image pairs generated by the CLIP VIT-B/32 model.
翻译:五美元模型是一种轻量级文本到图像生成架构,能够从编码文本提示生成低维图像。该模型可在有限训练数据条件下,成功生成低维领域中准确且具有美学价值的内容。尽管模型规模与数据集较小,生成的图像仍能保留文本提示的编码语义信息。我们将该模型应用于三个小型数据集——像素艺术游戏地图、游戏精灵图以及降采样表情符号图像,并采用新型数据增强策略来提升模型在有限数据集上的性能。模型表现通过CLIP VIT-B/32模型生成的文本-图像对余弦相似度得分进行评估。