mc-BEiT: Multi-choice Discretization for Image BERT Pre-training

Xiaotong Li; Yixiao Ge; Kun Yi; Zixuan Hu; Ying Shan; Ling-Yu Duan

mc-BEiT：画像BERT事前トレーニングのための複数選択の離散化

マスクされた画像モデリング（MIM）を使用した画像BERTの事前トレーニングは、自己教師あり表現学習に対処するための一般的な方法になります。独創的な作品であるBEiTは、MIMを視覚語彙を使用した分類タスクとしてキャストし、事前に学習したdVAEを使用して連続視覚信号を離散視覚トークンにトークン化します。実行可能な解決策にもかかわらず、不適切な離散化は、画像の事前トレーニングのさらなる改善を妨げます。画像の離散化には真実の答えがないため、より優れたトークナイザーを取得できたとしても、マスクされたパッチに一意のトークンIDを割り当てるべきではないと考えています。この作業では、改良されたBERTスタイルの画像事前トレーニング方法、つまりmc-BEiTを紹介します。これは、緩和された洗練された複数選択のトレーニング目標に向けてMIMプロキシタスクを実行します。具体的には、マスクされた画像パッチの複数選択の監視は、個別のトークンIDのソフト確率ベクトルによって形成されます。これは、既製の画像トークナイザーによって予測され、パッチ間の高レベルの認識によってさらに洗練されます。同様のパッチがそれらの選択を共有するべきであるという観察。分類、セグメンテーション、および検出タスクに関する広範な実験は、私たちの方法の優位性を示しています。たとえば、事前にトレーニングされたViT-Bは、ImageNet-1K分類で84.1％のトップ1微調整精度、49.2％のAP ^ bおよび44.0％を達成します。 COCOでのオブジェクト検出とインスタンスセグメンテーションのAP^m、ADE20Kセマンティックセグメンテーションでの50.8％mIOUは、競合製品を上回っています。コードはhttps://github.com/lixiaotong97/mc-BEiTで入手できます。

Image BERT pre-training with masked image modeling (MIM) becomes a popular practice to cope with self-supervised representation learning. A seminal work, BEiT, casts MIM as a classification task with a visual vocabulary, tokenizing the continuous visual signals into discrete vision tokens using a pre-learned dVAE. Despite a feasible solution, the improper discretization hinders further improvements of image pre-training. Since image discretization has no ground-truth answers, we believe that the masked patch should not be assigned with a unique token id even if a better tokenizer can be obtained. In this work, we introduce an improved BERT-style image pre-training method, namely mc-BEiT, which performs MIM proxy tasks towards eased and refined multi-choice training objectives. Specifically, the multi-choice supervision for the masked image patches is formed by the soft probability vectors of the discrete token ids, which are predicted by the off-the-shelf image tokenizer and further refined by high-level inter-patch perceptions resorting to the observation that similar patches should share their choices. Extensive experiments on classification, segmentation, and detection tasks demonstrate the superiority of our method, e.g., the pre-trained ViT-B achieves 84.1% top-1 fine-tuning accuracy on ImageNet-1K classification, 49.2% AP^b and 44.0% AP^m of object detection and instance segmentation on COCO, 50.8% mIOU on ADE20K semantic segmentation, outperforming the competitive counterparts. The code will be available at https://github.com/lixiaotong97/mc-BEiT.

updated: Thu Jul 28 2022 03:56:05 GMT+0000 (UTC)

published: Tue Mar 29 2022 09:08:18 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト