MST: Masked Self-Supervised Transformer for Visual Representation

Zhaowen Li; Zhiyang Chen; Fan Yang; Wei Li; Yousong Zhu; Chaoyang Zhao; Rui Deng; Liwei Wu; Rui Zhao; Ming Tang; Jinqiao Wang

MST：視覚的表現のためのマスクされた自己監視トランス

Transformer は、自然言語処理 (NLP) の自己教師あり事前トレーニングに広く使用されており、大きな成功を収めています。ただし、視覚的な教師あり学習では十分に検討されていません。一方、以前の方法では、グローバルな観点から高レベルの特徴と学習表現のみを考慮しているため、ローカルな特徴に焦点を当てた下流の高密度予測タスクに転送できない可能性があります。この論文では、MSTという名前の新しいMasked Self-supervised Transformerアプローチを紹介します。これは、グローバルなセマンティック情報を保持しながら、画像のローカルコンテキストを明示的にキャプチャできます。具体的には、NLPのマスク言語モデリング（MLM）に触発されて、マルチヘッド自己注意マップに基づくマスクトークン戦略を提案します。これは、自己教師あり学習の重要な構造に損傷を与えることなく、ローカルパッチの一部のトークンを動的にマスクします。さらに重要なことに、マスクされたトークンは、残りのトークンとともに、グローバル画像デコーダーによってさらに回復されます。グローバル画像デコーダーは、画像の空間情報を保持し、下流の高密度予測タスクにより適しています。複数のデータセットでの実験は、提案された方法の有効性と一般性を示しています。たとえば、MSTは線形評価による300エポックの事前トレーニングのみを使用してDeiT-Sで76.9％のトップ1精度を達成します。これは、同じエポックの教師ありメソッドを0.4％、同等のバリアントDINOを1.0％上回っています。密な予測タスクの場合、MSTはMS COCOオブジェクト検出で42.7％mAPを達成し、100エポックの事前トレーニングのみでCityscapesセグメンテーションで74.04％mIoUを達成します。

Transformer has been widely used for self-supervised pre-training in Natural Language Processing (NLP) and achieved great success. However, it has not been fully explored in visual self-supervised learning. Meanwhile, previous methods only consider the high-level feature and learning representation from a global perspective, which may fail to transfer to the downstream dense prediction tasks focusing on local features. In this paper, we present a novel Masked Self-supervised Transformer approach named MST, which can explicitly capture the local context of an image while preserving the global semantic information. Specifically, inspired by the Masked Language Modeling (MLM) in NLP, we propose a masked token strategy based on the multi-head self-attention map, which dynamically masks some tokens of local patches without damaging the crucial structure for self-supervised learning. More importantly, the masked tokens together with the remaining tokens are further recovered by a global image decoder, which preserves the spatial information of the image and is more friendly to the downstream dense prediction tasks. The experiments on multiple datasets demonstrate the effectiveness and generality of the proposed method. For instance, MST achieves Top-1 accuracy of 76.9% with DeiT-S only using 300-epoch pre-training by linear evaluation, which outperforms supervised methods with the same epoch by 0.4% and its comparable variant DINO by 1.0%. For dense prediction tasks, MST also achieves 42.7% mAP on MS COCO object detection and 74.04% mIoU on Cityscapes segmentation only with 100-epoch pre-training.

updated: Sun Oct 24 2021 06:59:05 GMT+0000 (UTC)

published: Thu Jun 10 2021 11:05:18 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト