MixPro: Data Augmentation with MaskMix and Progressive Attention Labeling for Vision Transformer

Qihao Zhao; Yangyu Huang; Wei Hu; Fan Zhang; Jun Liu

MixPro:MaskMix によるデータ拡張と Vision Transformer のプログレッシブアテンションラベリング

最近提案されたデータ拡張 TransMix は、アテンションラベルを採用して、ビジュアルトランスフォーマー (ViT) の堅牢性とパフォーマンスの向上を支援します。ただし、TransMix には次の 2 つの点で欠陥があります。 1) TransMix の画像トリミング方法は ViT に適していない可能性があります。 2) トレーニングの初期段階では、モデルは信頼性の低いアテンションマップを生成します。 TransMix は、信頼性の低いアテンションマップを使用して、モデルに影響を与える可能性のある混合アテンションラベルを計算します。前述の問題に対処するために、画像空間とラベル空間でそれぞれ MaskMix とプログレッシブアテンションラベリング (PAL) を提案します。具体的には、画像空間の観点から、パッチ状のグリッドマスクに基づいて2つの画像を混合するMaskMixを設計します。特に、各マスクパッチのサイズは調整可能であり、イメージパッチサイズの倍数であるため、各イメージパッチは 1 つのイメージのみから取得され、よりグローバルなコンテンツが含まれることが保証されます。ラベル空間の観点から、混合アテンションラベルのアテンションの重みを動的に再重み付ける累進係数を利用する PAL を設計します。最後に、MaskMix とプログレッシブアテンションラベリングを、MixPro という名前の新しいデータ拡張手法として組み合わせます。実験結果は、私たちの方法がImageNet分類の規模でさまざまなViTベースのモデルを改善できることを示しています（300エポックのDeiT-Tに基づいて73.8％のトップ1精度）。 ImageNet 上の MixPro で事前トレーニングされた後、ViT ベースのモデルは、セマンティックセグメンテーション、オブジェクト検出、インスタンスセグメンテーションへのより優れた移行性も示しています。さらに、TransMix と比較して、MixPro はいくつかのベンチマークで強力な堅牢性も示しています。コードは https://github.com/fistyee/MixPro で入手できます。

The recently proposed data augmentation TransMix employs attention labels to help visual transformers (ViT) achieve better robustness and performance. However, TransMix is deficient in two aspects: 1) The image cropping method of TransMix may not be suitable for ViTs. 2) At the early stage of training, the model produces unreliable attention maps. TransMix uses unreliable attention maps to compute mixed attention labels that can affect the model. To address the aforementioned issues, we propose MaskMix and Progressive Attention Labeling (PAL) in image and label space, respectively. In detail, from the perspective of image space, we design MaskMix, which mixes two images based on a patch-like grid mask. In particular, the size of each mask patch is adjustable and is a multiple of the image patch size, which ensures each image patch comes from only one image and contains more global contents. From the perspective of label space, we design PAL, which utilizes a progressive factor to dynamically re-weight the attention weights of the mixed attention label. Finally, we combine MaskMix and Progressive Attention Labeling as our new data augmentation method, named MixPro. The experimental results show that our method can improve various ViT-based models at scales on ImageNet classification (73.8% top-1 accuracy based on DeiT-T for 300 epochs). After being pre-trained with MixPro on ImageNet, the ViT-based models also demonstrate better transferability to semantic segmentation, object detection, and instance segmentation. Furthermore, compared to TransMix, MixPro also shows stronger robustness on several benchmarks. The code is available at https://github.com/fistyee/MixPro.

updated: Mon Aug 07 2023 10:20:59 GMT+0000 (UTC)

published: Mon Apr 24 2023 12:38:09 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト