Sliced Recursive Transformer

Zhiqiang Shen; Zechun Liu; Eric Xing

スライスされた再帰トランス

追加のパラメーターを使用せずにパラメーターの使用率を向上させることができる、ビジョントランスフォーマーのきちんとした効果的な再帰操作を紹介します。これは、トランスネットワークの深さ全体で重みを共有することによって実現されます。提案された方法は、単純な再帰操作を使用するだけで実質的なゲイン（〜2％）を得ることができ、ネットワークの原理を設計するための特別な知識や高度な知識を必要とせず、トレーニング手順に最小限の計算オーバーヘッドを導入します。優れた精度を維持しながら、再帰的操作によって生じる追加の計算を減らすために、パフォーマンスの低下を最小限に抑えながら、コスト消費を10〜30％削減できる、再帰的レイヤー全体にわたる複数のスライスグループの自己注意による近似方法を提案します。モデルをSlicedRecursiveTransformer（SReT）と呼びます。これは、効率的なViTアーキテクチャのための他のさまざまな設計と互換性のある、新しいパラメータ効率の高いビジョントランス設計です。私たちの最良のモデルは、より少ないパラメーターを含みながら、最先端の方法よりもImageNet-1Kの大幅な改善を確立します。スライスされた再帰構造による提案された重み共有メカニズムにより、モデルが大きすぎる場合の最適化の問題を回避するために、コンパクトなサイズ（13〜15M）を維持しながら、100以上または1000以上の共有レイヤーを備えたトランスを簡単に構築できます。柔軟なスケーラビリティは、モデルをスケールアップし、非常に深いビジョンのトランスフォーマーを構築するための大きな可能性を示しています。コードはhttps://github.com/szq0214/SReTで入手できます。

We present a neat yet effective recursive operation on vision transformers that can improve parameter utilization without involving additional parameters. This is achieved by sharing weights across the depth of transformer networks. The proposed method can obtain a substantial gain (~2%) simply using naive recursive operation, requires no special or sophisticated knowledge for designing principles of networks, and introduces minimal computational overhead to the training procedure. To reduce the additional computation caused by recursive operation while maintaining the superior accuracy, we propose an approximating method through multiple sliced group self-attentions across recursive layers which can reduce the cost consumption by 10~30% with minimal performance loss. We call our model Sliced Recursive Transformer (SReT), a novel and parameter-efficient vision transformer design that is compatible with a broad range of other designs for efficient ViT architectures. Our best model establishes significant improvement on ImageNet-1K over state-of-the-art methods while containing fewer parameters. The proposed weight sharing mechanism by sliced recursion structure allows us to build a transformer with more than 100 or even 1000 shared layers with ease while keeping a compact size (13~15M), to avoid optimization difficulties when the model is too large. The flexible scalability has shown great potential for scaling up models and constructing extremely deep vision transformers. Code is available at https://github.com/szq0214/SReT.

updated: Sat Jul 23 2022 14:34:14 GMT+0000 (UTC)

published: Tue Nov 09 2021 17:59:14 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト