Early Convolutions Help Transformers See Better

Tete Xiao; Mannat Singh; Eric Mintun; Trevor Darrell; Piotr Dollár; Ross Girshick

初期の畳み込みは、トランスフォーマーの見栄えを良くするのに役立ちます

ビジョントランスフォーマー（ViT）モデルは、標準以下の最適化可能性を示します。特に、オプティマイザー（AdamWとSGD）、オプティマイザーのハイパーパラメーター、およびトレーニングスケジュールの長さの選択に敏感です。比較すると、最新の畳み込みニューラルネットワークは最適化が容易です。なぜそうなのですか？この作業では、問題はViTモデルのpatchifyステムにあると推測します。これは、入力画像に適用されるストライド-p p * p畳み込み（デフォルトではp = 16）によって実装されます。この大きなカーネルと大きなストライドの畳み込みは、ニューラルネットワークの畳み込み層の一般的な設計上の選択に反します。この非定型の設計選択が問題を引き起こすかどうかをテストするために、元のパッチ化ステムを使用したViTモデルの最適化動作と、ViTステムを少数のスタックストライド2 3 * 3畳み込みに置き換える単純な対応物を分析します。 2つのViT設計の計算の大部分は同じですが、初期の視覚処理におけるこの小さな変更により、最適化設定に対する感度と最終的なモデルの精度の点で、トレーニング動作が著しく異なることがわかります。 ViTで畳み込みステムを使用すると、フロップとランタイムを維持しながら、最適化の安定性が劇的に向上し、ピークパフォーマンスも向上します（ImageNet-1kでトップ1の精度が約1〜2％向上します）。改善は、モデルの複雑さ（1Gから36Gフロップまで）およびデータセットスケール（ImageNet-1kからImageNet-21kまで）の広いスペクトルにわたって観察できます。これらの調査結果から、元のViTモデル設計と比較してより堅牢なアーキテクチャの選択肢として、この体制のViTモデルに標準の軽量畳み込みステムを使用することをお勧めします。

Vision transformer (ViT) models exhibit substandard optimizability. In particular, they are sensitive to the choice of optimizer (AdamW vs. SGD), optimizer hyperparameters, and training schedule length. In comparison, modern convolutional neural networks are easier to optimize. Why is this the case? In this work, we conjecture that the issue lies with the patchify stem of ViT models, which is implemented by a stride-p p*p convolution (p=16 by default) applied to the input image. This large-kernel plus large-stride convolution runs counter to typical design choices of convolutional layers in neural networks. To test whether this atypical design choice causes an issue, we analyze the optimization behavior of ViT models with their original patchify stem versus a simple counterpart where we replace the ViT stem by a small number of stacked stride-two 3*3 convolutions. While the vast majority of computation in the two ViT designs is identical, we find that this small change in early visual processing results in markedly different training behavior in terms of the sensitivity to optimization settings as well as the final model accuracy. Using a convolutional stem in ViT dramatically increases optimization stability and also improves peak performance (by ~1-2% top-1 accuracy on ImageNet-1k), while maintaining flops and runtime. The improvement can be observed across the wide spectrum of model complexities (from 1G to 36G flops) and dataset scales (from ImageNet-1k to ImageNet-21k). These findings lead us to recommend using a standard, lightweight convolutional stem for ViT models in this regime as a more robust architectural choice compared to the original ViT model design.

updated: Mon Oct 25 2021 19:54:23 GMT+0000 (UTC)

published: Mon Jun 28 2021 17:59:33 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト