Global Vision Transformer Pruning with Hessian-Aware Saliency

Huanrui Yang; Hongxu Yin; Maying Shen; Pavlo Molchanov; Hai Li; Jan Kautz

Hessian-Aware Saliency を使用した Global Vision Transformer プルーニング

トランスフォーマーは、多くのタスクで最先端の結果をもたらします。ただし、ヒューリスティックに設計されたアーキテクチャは、推論中に膨大な計算コストを課します。この作業は、モデル段階でスタックされたすべてのブロックにわたって均一な次元を持つビジョントランスフォーマー (ViT) モデルの共通の設計哲学に挑戦することを目的としています。ここでは、最初の体系的な方法を介して、トランスブロック間およびブロック内の異なる構造間でパラメーターを再分配します。グローバルな構造剪定を試みます。多様な ViT 構造コンポーネントを扱い、すべてのレイヤーと構造で比較可能な新しいヘッセ行列ベースの構造プルーニング基準を導出し、直接レイテンシを削減するためのレイテンシを意識した正則化を行います。 DeiT-Base モデルで反復プルーニングを実行すると、NViT (Novel ViT) と呼ばれる新しいアーキテクチャファミリが生まれ、パラメータをより効率的に利用する新しいパラメータ再分配が行われます。 ImageNet-1K では、NViT-Base は、DeiT-Base モデルよりも 2.6 倍の FLOP 削減、5.1 倍のパラメーター削減、および 1.9 倍のランタイムスピードアップをほぼ無損失で達成します。より小さな NViT バリアントは、DeiT Small/Tiny バリアントと同じスループットで 1% 以上の精度向上を達成し、SWIN-Small モデルよりも 3.3 倍のロスレスパラメータ削減を実現します。これらの結果は、先行技術を大幅に上回っている。 NViT のパラメーター再分布の洞察に関するさらなる分析が提供されます。ここでは、ViT モデルの高いプルーニング可能性、ViT ブロック内の明確な感度、スタックされた ViT ブロック全体での固有のパラメーター分布傾向を示します。私たちの洞察は、既製のパフォーマンス向上のためのより効率的な ViT に向けた、シンプルでありながら効果的なパラメーター再分配ルールの実行可能性を提供します。

Transformers yield state-of-the-art results across many tasks. However, their heuristically designed architecture impose huge computational costs during inference. This work aims on challenging the common design philosophy of the Vision Transformer (ViT) model with uniform dimension across all the stacked blocks in a model stage, where we redistribute the parameters both across transformer blocks and between different structures within the block via the first systematic attempt on global structural pruning. Dealing with diverse ViT structural components, we derive a novel Hessian-based structural pruning criteria comparable across all layers and structures, with latency-aware regularization for direct latency reduction. Performing iterative pruning on the DeiT-Base model leads to a new architecture family called NViT (Novel ViT), with a novel parameter redistribution that utilizes parameters more efficiently. On ImageNet-1K, NViT-Base achieves a 2.6x FLOPs reduction, 5.1x parameter reduction, and 1.9x run-time speedup over the DeiT-Base model in a near lossless manner. Smaller NViT variants achieve more than 1% accuracy gain at the same throughput of the DeiT Small/Tiny variants, as well as a lossless 3.3x parameter reduction over the SWIN-Small model. These results outperform prior art by a large margin. Further analysis is provided on the parameter redistribution insight of NViT, where we show the high prunability of ViT models, distinct sensitivity within ViT block, and unique parameter distribution trend across stacked ViT blocks. Our insights provide viability for a simple yet effective parameter redistribution rule towards more efficient ViTs for off-the-shelf performance boost.

updated: Wed Mar 29 2023 21:00:43 GMT+0000 (UTC)

published: Sun Oct 10 2021 18:04:59 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト