Accelerating Vision Transformer Training via a Patch Sampling Schedule

Bradley McDanel; Chi Phuong Huynh

パッチサンプリングスケジュールによるビジョントランスフォーマートレーニングの加速

パッチサンプリングスケジュール (PSS) の概念を導入します。これは、トレーニング中にバッチごとに使用されるビジョントランスフォーマー (ViT) パッチの数を変化させます。ほとんどの視覚目標 (分類など) にとってすべてのパッチが等しく重要であるとは限らないため、重要度の低いパッチはより少ないトレーニング反復で使用でき、パフォーマンスへの影響を最小限に抑えてトレーニング時間を短縮できると主張します。さらに、PSS を使用したトレーニングにより、推論中のより広いパッチサンプリング範囲に対して ViT がより堅牢になることがわかります。これにより、推論中のスループットと精度の間のきめの細かい動的なトレードオフが可能になります。ゼロからトレーニングされたものと、再構成損失関数を使用して事前にトレーニングされたものの両方で、ImageNet の ViT で PSS を使用して評価します。事前トレーニング済みモデルの場合、反復ごとにすべてのパッチを使用する場合と比較して、トレーニング時間 (25 時間から 17 時間) が 31% 削減され、分類精度が 0.26% 低下します。コード、モデルチェックポイント、およびログは、https://github.com/BradMcDanel/pss で入手できます。

We introduce the notion of a Patch Sampling Schedule (PSS), that varies the number of Vision Transformer (ViT) patches used per batch during training. Since all patches are not equally important for most vision objectives (e.g., classification), we argue that less important patches can be used in fewer training iterations, leading to shorter training time with minimal impact on performance. Additionally, we observe that training with a PSS makes a ViT more robust to a wider patch sampling range during inference. This allows for a fine-grained, dynamic trade-off between throughput and accuracy during inference. We evaluate using PSSs on ViTs for ImageNet both trained from scratch and pre-trained using a reconstruction loss function. For the pre-trained model, we achieve a 0.26% reduction in classification accuracy for a 31% reduction in training time (from 25 to 17 hours) compared to using all patches each iteration. Code, model checkpoints and logs are available at https://github.com/BradMcDanel/pss.

updated: Fri Aug 19 2022 19:16:46 GMT+0000 (UTC)

published: Fri Aug 19 2022 19:16:46 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト