MixPro: Data Augmentation with MaskMix and Progressive Attention Labeling for Vision Transformer

Qihao Zhao; Yangyu Huang; Wei Hu; Fan Zhang; Jun Liu

MixPro: MaskMix と Vision Transformer のプログレッシブアテンションラベリングによるデータ増強

最近提案されたデータ拡張 TransMix は、アテンションラベルを採用して、ビジュアルトランスフォーマー (ViT) がより優れた堅牢性とパフォーマンスを実現できるようにします。ただし、TransMix には次の 2 つの点で欠点があります。1) TransMix の画像トリミング方法は、ビジョントランスフォーマーには適していない可能性があります。 2) トレーニングの初期段階で、モデルは信頼性の低いアテンションマップを生成します。 TransMix は信頼性の低いアテンションマップを使用して、モデルに影響を与える可能性のある混合アテンションラベルを計算します。前述の問題に対処するために、イメージとラベル空間でそれぞれ MaskMix とプログレッシブアテンションラベリング (PAL) を提案します。詳細には、画像空間の観点から、パッチ状のグリッドマスクに基づいて 2 つの画像を混合する MaskMix を設計します。特に、各マスクパッチのサイズは調整可能で、イメージパッチサイズの倍数です。これにより、各イメージパッチが 1 つのイメージのみから取得され、より多くのグローバルコンテンツが含まれるようになります。ラベルスペースの観点から、混合アテンションラベルのアテンションウェイトを動的に再重み付けするプログレッシブファクターを利用する PAL を設計します。最後に、MaskMix と Progressive Attention Labeling を、MixPro という名前の新しいデータ拡張メソッドとして組み合わせます。実験結果は、私たちの方法が ImageNet 分類のスケールでさまざまな ViT ベースのモデルを改善できることを示しています (300 エポックの DeiT-T に基づく 73.8% のトップ 1 精度)。 ImageNet 上の MixPro で事前トレーニングされた後、ViT ベースのモデルは、セマンティックセグメンテーション、オブジェクト検出、およびインスタンスセグメンテーションへのより優れた転送可能性も示しています。さらに、TransMix と比較して、MixPro はいくつかのベンチマークでもより強力な堅牢性を示しています。コードは https://github.com/fistyee/MixPro で公開されます。

The recently proposed data augmentation TransMix employs attention labels to help visual transformers (ViT) achieve better robustness and performance. However, TransMix is deficient in two aspects: 1) The image cropping method of TransMix may not be suitable for vision transformer. 2) At the early stage of training, the model produces unreliable attention maps. TransMix uses unreliable attention maps to compute mixed attention labels that can affect the model. To address the aforementioned issues, we propose MaskMix and Progressive Attention Labeling (PAL) in image and label space, respectively. In detail, from the perspective of image space, we design MaskMix, which mixes two images based on a patch-like grid mask. In particular, the size of each mask patch is adjustable and is a multiple of the image patch size, which ensures each image patch comes from only one image and contains more global contents. From the perspective of label space, we design PAL, which utilizes a progressive factor to dynamically re-weight the attention weights of the mixed attention label. Finally, we combine MaskMix and Progressive Attention Labeling as our new data augmentation method, named MixPro. The experimental results show that our method can improve various ViT-based models at scales on ImageNet classification (73.8% top-1 accuracy based on DeiT-T for 300 epochs). After being pre-trained with MixPro on ImageNet, the ViT-based models also demonstrate better transferability to semantic segmentation, object detection, and instance segmentation. Furthermore, compared to TransMix, MixPro also shows stronger robustness on several benchmarks. The code will be released at https://github.com/fistyee/MixPro.

updated: Mon Apr 24 2023 12:38:09 GMT+0000 (UTC)

published: Mon Apr 24 2023 12:38:09 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト