Aggregating Nested Transformers

Zizhao Zhang; Han Zhang; Long Zhao; Ting Chen; Tomas Pfister

ネストされたトランスフォーマーの集約

最近のビジョントランスフォーマーでは階層構造が一般的ですが、うまく機能するには高度な設計と大規模なデータセットが必要です。この作業では、重複しないイメージブロックに基本的なローカルトランスフォーマーをネストし、それらを階層的に集約するというアイデアを検討します。ブロック集約機能は、クロスブロックの非ローカル情報通信を可能にする上で重要な役割を果たしていることがわかります。この観察により、元のビジョントランスフォーマーにわずかなコード変更を加えた単純化されたアーキテクチャを設計し、既存の方法と比較してパフォーマンスを向上させることができます。私たちの経験的結果は、提案された方法NesTがより速く収束し、優れた一般化を達成するために必要なトレーニングデータがはるかに少ないことを示しています。たとえば、100/300エポックでImageNetでトレーニングされた68MパラメータのNesTは、224×224の画像サイズで評価された82.3％/ 83.8％の精度を達成し、最大57％のパラメータ削減で以前の方法を上回ります。 CIFAR10で6Mパラメーターを使用してNesTを最初からトレーニングすると、単一のGPUを使用して96％の精度が達成され、ビジョントランスフォーマーの新しい最先端が設定されます。画像分類を超えて、重要なアイデアを画像生成に拡張し、NesTが以前のトランスベースのジェネレーターよりも8倍高速な強力なデコーダーにつながることを示します。さらに、学習したモデルを視覚的に解釈するための新しい方法も提案します。ソースコードはhttps://github.com/google-research/nested-transformerで入手できます。

Although hierarchical structures are popular in recent vision transformers, they require sophisticated designs and massive datasets to work well. In this work, we explore the idea of nesting basic local transformers on non-overlapping image blocks and aggregating them in a hierarchical manner. We find that the block aggregation function plays a critical role in enabling cross-block non-local information communication. This observation leads us to design a simplified architecture with minor code changes upon the original vision transformer and obtains improved performance compared to existing methods. Our empirical results show that the proposed method NesT converges faster and requires much less training data to achieve good generalization. For example, a NesT with 68M parameters trained on ImageNet for 100/300 epochs achieves 82.3%/83.8% accuracy evaluated on 224×224 image size, outperforming previous methods with up to 57% parameter reduction. Training a NesT with 6M parameters from scratch on CIFAR10 achieves 96% accuracy using a single GPU, setting a new state of the art for vision transformers. Beyond image classification, we extend the key idea to image generation and show NesT leads to a strong decoder that is 8× faster than previous transformer based generators. Furthermore, we also propose a novel method for visually interpreting the learned model. Source code is available https://github.com/google-research/nested-transformer.

updated: Sat Jun 19 2021 02:36:02 GMT+0000 (UTC)

published: Wed May 26 2021 17:56:48 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト