| Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation |
|---|
| 作者: Peize Sun; Yi Jiang; Shoufa Chen; Shilong Zhang; Bingyue Peng; Ping Luo; Zehuan Yuan |
| DOI:10.48550/arXiv.2406.06525 |
| 摘要:We introduce LlamaGen, a new family of image generation models that apply original ``next-token prediction’’ paradigm of large language models to visual generation domain. It is an affirmative answer to whether vanilla autoregressive models, e.g., Llama, without inductive biases on visual signals can achieve state-of-the-art image generation performance if scaling properly. We reexamine design spaces of image tokenizers, scalability properties of image generation models, and their training data quality. The outcome of this exploration consists of: (1) An image tokenizer with downsample ratio of 16, reconstruction quality of 0.94 rFID and codebook usage of 97% on ImageNet benchmark. (2) A series of class-conditional image generation models ranging from 111M to 3.1B parameters, achieving 2.18 FID on ImageNet 256x256 benchmarks, outperforming the popular diffusion models such as LDM, DiT. (3) A text-conditional image generation model with 775M parameters, from two-stage training on LAION-COCO and high aesthetics quality images, demonstrating competitive performance of visual quality and text alignment. (4) We verify the effectiveness of LLM serving frameworks in optimizing the inference speed of image generation models and achieve 326% - 414% speedup. We release all models and codes to facilitate open-source community of visual generation and multimodal foundation models. |
| GitHub(pytorch): https://github.com/foundationvision/llamagen |
本研究介绍了LlamaGen,这是一个新型图像生成模型系列,它将大型语言模型(LLMs)的“下一个令牌预测”范式应用于视觉生成领域。LlamaGen探索了是否未经视觉信号归纳偏置调整的原始自回归模型,如Llama,通过适当扩展后能实现顶尖的图像生成性能。研究重新审视了图像分块器的设计空间、图像生成模型的可扩展性属性以及训练数据质量。
LlamaGen系列模型证明了自回归方法在适当规模下能够超越扩散模型,成为可扩展的图像生成解决方案。通过精心设计的图像分块器、大规模模型和高质量训练数据,LlamaGen不仅在类别条件图像生成上取得突破,在文本条件图像生成方面也展示出竞争力。此外,研究还强调了LLM服务框架在提高模型推理速度上的作用,并公开了所有模型和代码以促进视觉生成和多模态基础模型的开源社区发展。
该研究通过LlamaGen模型展示了自回归方法在图像生成任务中的巨大潜力,特别是其在不依赖特定视觉信号归纳偏置的情况下,通过大规模扩展和高质量数据训练达到了前所未有的性能水平。然而,存在的局限性和对未来研究方向的展望(如更高分辨率的图像生成)提示,进一步的研究可以通过增加训练数据量和提高计算能力来继续提升模型的表现,尤其是在文本-图像对齐的准确性以及解决当前模型存在的具体错误方面。此外,对模型推理速度的优化策略是实际应用中的一个重要考虑因素,LlamaGen在这方面也做出了积极贡献。