Autoregressive Pretraining with Mamba in Vision

Sucheng Ren; Xianhang Li; Haoqin Tu; Feng Wang; Fangxun Shu; Lei Zhang; Jieru Mei; Linjie Yang; Peng Wang; Heng Wang; Alan Yuille; Cihang Xie

The vision community has started to build with the recently developed state space model, Mamba, as the new backbone for a range of tasks. This paper shows that Mamba's visual capability can be significantly enhanced through autoregressive pretraining, a direction not previously explored. Efficiency-wise, the autoregressive nature can well capitalize on the Mamba's unidirectional recurrent structure, enabling faster overall training speed compared to other training strategies like mask modeling. Performance-wise, autoregressive pretraining equips the Mamba architecture with markedly higher accuracy over its supervised-trained counterparts and, more importantly, successfully unlocks its scaling potential to large and even huge model sizes. For example, with autoregressive pretraining, a base-size Mamba attains 83.2% ImageNet accuracy, outperforming its supervised counterpart by 2.0%; our huge-size Mamba, the largest Vision Mamba to date, attains 85.0% ImageNet accuracy (85.5% when finetuned with 384×384 inputs), notably surpassing all other Mamba variants in vision. The code is available at https://github.com/OliverRensu/ARM.

updated: Tue Jun 11 2024 17:58:34 GMT+0000 (UTC)

published: Tue Jun 11 2024 17:58:34 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト