PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers

Xiaoyi Dong; Jianmin Bao; Ting Zhang; Dongdong Chen; Weiming Zhang; Lu Yuan; Dong Chen; Fang Wen; Nenghai Yu

PeCo：ビジョントランスフォーマーのBERT事前トレーニングのための知覚コードブック

このホワイトペーパーでは、ビジョントランスフォーマーのBERT事前トレーニングのためのより優れたコードブックについて説明します。最近の作業BEiTは、BERTの事前トレーニングをNLPからビジョンフィールドに正常に転送します。ビジュアルトークナイザーとして1つの単純な個別VAEを直接採用しますが、結果のビジュアルトークンのセマンティックレベルは考慮していません。対照的に、NLPフィールドの個別のトークンは当然非常にセマンティックです。この違いは、知覚コードブックを学ぶ動機になります。そして、驚くべきことに、dVAEトレーニング中に知覚的類似性を強制するという1つの単純で効果的なアイデアを見つけました。提案された知覚コードブックによって生成されたビジュアルトークンがより良い意味的意味を示し、その後、事前トレーニングがさまざまなダウンストリームタスクで優れた転送パフォーマンスを達成するのに役立つことを示します。たとえば、ViT-Bバックボーンを備えたImageNet-1Kで84.5％のTop-1精度を達成し、同じ事前トレーニングエポックで競合メソッドBEiTを+1.3上回っています。また、COCO valでのオブジェクト検出およびセグメンテーションタスクのパフォーマンスを+1.3ボックスAPおよび+1.0マスクAPで、ADE20kでのセマンティックセグメンテーションを+ 1.0mIoUで向上させることができます。より大きなバックボーンViT-Hを搭載し、ImageNet-1Kデータのみを使用する方法の中で最先端のパフォーマンス（88.3％のトップ1精度）を達成します。コードとモデルはhttps://github.com/microsoft/PeCoで入手できます。

This paper explores a better codebook for BERT pre-training of vision transformers. The recent work BEiT successfully transfers BERT pre-training from NLP to the vision field. It directly adopts one simple discrete VAE as the visual tokenizer, but has not considered the semantic level of the resulting visual tokens. By contrast, the discrete tokens in NLP field are naturally highly semantic. This difference motivates us to learn a perceptual codebook. And we surprisingly find one simple yet effective idea: enforcing perceptual similarity during the dVAE training. We demonstrate that the visual tokens generated by the proposed perceptual codebook do exhibit better semantic meanings, and subsequently help pre-training achieve superior transfer performance in various downstream tasks. For example, we achieve 84.5% Top-1 accuracy on ImageNet-1K with ViT-B backbone, outperforming the competitive method BEiT by +1.3 with the same pre-training epochs. It can also improve the performance of object detection and segmentation tasks on COCO val by +1.3 box AP and +1.0 mask AP, semantic segmentation on ADE20k by +1.0 mIoU. Equipped with a larger backbone ViT-H, we achieve the state-of-the-art performance (88.3% Top-1 accuracy) among the methods using only ImageNet-1K data. The code and models will be available at https://github.com/microsoft/PeCo.

updated: Thu Jan 06 2022 18:59:59 GMT+0000 (UTC)

published: Wed Nov 24 2021 18:59:58 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト