PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers

Xiaoyi Dong; Jianmin Bao; Ting Zhang; Dongdong Chen; Weiming Zhang; Lu Yuan; Dong Chen; Fang Wen; Nenghai Yu

PeCo: ビジョントランスフォーマーの BERT 事前トレーニング用の知覚コードブック

このホワイトペーパーでは、ビジョントランスフォーマーの BERT 事前トレーニングのためのより良い予測ターゲットについて説明します。現在の予測ターゲットは、人間の知覚判断と一致しないことがわかります。この矛盾は、知覚予測ターゲットを学習する動機になります。知覚的に類似した画像は、予測ターゲット空間で互いに近くに留まる必要があると主張します。驚くべきことに、dVAE トレーニング中に知覚的類似性を強制するという、シンプルでありながら効果的なアイデアが 1 つ見つかりました。さらに、深い特徴抽出のために自己教師付き変換モデルを採用し、それが知覚的類似性の計算にうまく機能することを示します。そのような学習された視覚的トークンが実際により良い意味論的意味を示し、事前トレーニングがさまざまなダウンストリームで優れた転送パフォーマンスを達成するのに役立つことを示します。タスク。たとえば、ViT-B バックボーンを使用した ImageNet-1K で 84.5% のトップ 1 精度を達成し、同じ事前トレーニングエポックで競合する方法である BEiT を +1.3% 上回っています。私たちのアプローチは、COCO でのオブジェクト検出とセグメンテーション、および ADE20K でのセマンティックセグメンテーションも大幅に改善されています。より大型のバックボーン ViT-H を搭載し、ImageNet-1K データのみを使用するメソッドの中で最先端の ImageNet 精度 (88.3%) を実現します。

This paper explores a better prediction target for BERT pre-training of vision transformers. We observe that current prediction targets disagree with human perception judgment.This contradiction motivates us to learn a perceptual prediction target. We argue that perceptually similar images should stay close to each other in the prediction target space. We surprisingly find one simple yet effective idea: enforcing perceptual similarity during the dVAE training. Moreover, we adopt a self-supervised transformer model for deep feature extraction and show that it works well for calculating perceptual similarity.We demonstrate that such learned visual tokens indeed exhibit better semantic meanings, and help pre-training achieve superior transfer performance in various downstream tasks. For example, we achieve 84.5% Top-1 accuracy on ImageNet-1K with ViT-B backbone, outperforming the competitive method BEiT by +1.3% under the same pre-training epochs. Our approach also gets significant improvement on object detection and segmentation on COCO and semantic segmentation on ADE20K. Equipped with a larger backbone ViT-H, we achieve the state-of-the-art ImageNet accuracy (88.3%) among methods using only ImageNet-1K data.

updated: Wed Dec 07 2022 19:11:20 GMT+0000 (UTC)

published: Wed Nov 24 2021 18:59:58 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト