ZigMa: A DiT-style Zigzag Mamba Diffusion Model

Vincent Tao Hu; Stefan Andreas Baumann; Ming Gui; Olga Grebenkova; Pingchuan Ma; Johannes Schusterbauer; Björn Ommer

The diffusion model has long been plagued by scalability and quadratic complexity issues, especially within transformer-based structures. In this study, we aim to leverage the long sequence modeling capability of a State-Space Model called Mamba to extend its applicability to visual data generation. Firstly, we identify a critical oversight in most current Mamba-based vision methods, namely the lack of consideration for spatial continuity in the scan scheme of Mamba. Secondly, building upon this insight, we introduce a simple, plug-and-play, zero-parameter method named Zigzag Mamba, which outperforms Mamba-based baselines and demonstrates improved speed and memory utilization compared to transformer-based baselines. Lastly, we integrate Zigzag Mamba with the Stochastic Interpolant framework to investigate the scalability of the model on large-resolution visual datasets, such as FacesHQ 1024×1024 and UCF101, MultiModal-CelebA-HQ, and MS COCO 256×256 . Code will be released at https://taohu.me/zigma/

updated: Sun Nov 24 2024 14:25:05 GMT+0000 (UTC)

published: Wed Mar 20 2024 17:59:14 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト