Grafting Vision Transformers

Jongwoo Park; Kumara Kahatapitiya; Donghyun Kim; Shivchander Sudalairaj; Quanfu Fan; Michael S. Ryoo

ビジョントランスフォーマーの移植

ビジョントランスフォーマー (ViT) は、最近、多くのコンピュータービジョンタスクで最先端のものになりました。畳み込みネットワーク (CNN) とは対照的に、ViT は、ネットワークの浅いレイヤー内、つまり高解像度フィーチャ間でもグローバルな情報共有を可能にします。ただし、この利点は、パフォーマンスと複雑さのトレードオフが優れている Swin Transformer などのピラミッドアーキテクチャの成功により、後に見落とされました。このホワイトペーパーでは、高解像度と低解像度の両方の機能で、ネットワーク全体のグローバルな依存関係とマルチスケール情報を同様に考慮する、シンプルで効率的なアドオンコンポーネント (GrafT と呼ばれる) を紹介します。 GrafT は、均一なトランスフォーマーとピラミッドトランスフォーマーの両方で簡単に採用でき、一貫したゲインを示します。任意の深さで分岐できる柔軟性があり、複数のスケールでネットワークを拡張できます。このグラフト操作により、バックボーンのほとんどのパラメーターと計算を共有できるようになり、追加される複雑さは最小限に抑えられますが、歩留まりは高くなります。実際、GrafT でマルチスケールの受容野を徐々に合成するプロセスにより、局所領域間の通信が可能になります。画像分類 (ImageNet-1K)、セマンティックセグメンテーション (ADE20K)、オブジェクト検出、インスタンスセグメンテーション (COCO2017) など、複数のベンチマークで提案された方法の利点を示します。私たちのコードとモデルが利用可能になります。

Vision Transformers (ViTs) have recently become the state-of-the-art across many computer vision tasks. In contrast to convolutional networks (CNNs), ViTs enable global information sharing even within shallow layers of a network, i.e., among high-resolution features. However, this perk was later overlooked with the success of pyramid architectures such as Swin Transformer, which show better performance-complexity trade-offs. In this paper, we present a simple and efficient add-on component (termed GrafT) that considers global dependencies and multi-scale information throughout the network, in both high- and low-resolution features alike. GrafT can be easily adopted in both homogeneous and pyramid Transformers while showing consistent gains. It has the flexibility of branching-out at arbitrary depths, widening a network with multiple scales. This grafting operation enables us to share most of the parameters and computations of the backbone, adding only minimal complexity, but with a higher yield. In fact, the process of progressively compounding multi-scale receptive fields in GrafT enables communications between local regions. We show the benefits of the proposed method on multiple benchmarks, including image classification (ImageNet-1K), semantic segmentation (ADE20K), object detection and instance segmentation (COCO2017). Our code and models will be made available.

updated: Fri Oct 28 2022 07:07:13 GMT+0000 (UTC)

published: Fri Oct 28 2022 07:07:13 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト