DiT: Efficient Vision Transformers with Dynamic Token Routing

Yuchen Ma; Zhengcong Fei; Junshi Huang

DiT: 動的トークンルーティングを備えた効率的なビジョントランスフォーマー

最近、画像のトークンは、多くの高密度ネットワークで同じ静的データフローを共有します。ただし、空間スケールの大きな変動や視覚的実体の認識の困難など、画像内のオブジェクト間の差異から課題が生じます。この論文では、Dynamic Vision Transformer (DiT と呼ばれる) の画像トークンのルーティングパスを詳しく説明する、データ依存のトークンルーティング戦略を提案します。提案されたフレームワークは、オブジェクトのスケールとトークンの視覚的識別に適応して、トークンごとにデータ依存のパスを生成します。フィードフォワードでは、微分可能なルーティングゲートは、画像トークンのスケーリングパスと特徴変換パスを選択するように設計されており、マルチパス特徴の伝播につながります。このようにして、オブジェクトのスケールの影響と画像表現の視覚的識別を注意深く調整できます。さらに、ルーティングゲートに予算制約を与え、特徴抽出を早期に停止することで、計算コストをさらに削減できます。実験では、当社の DiT は、ImageNet 分類、オブジェクト検出、インスタンスセグメンテーション、セマンティックセグメンテーションにおいて、多くの SoTA メソッドよりも優れたパフォーマンスと有利な複雑さと精度のトレードオフを達成しました。特に、DiT-B5 は、10.3 GFLOP で ImageNet 上で 84.8% のトップ 1 Acc を獲得しており、これは同様の計算量の SoTA メソッドより 1.0% 高いです。これらの広範な結果は、DiT がさまざまな視覚タスクの多用途のバックボーンとして機能できることを示しています。

Recently, the tokens of images share the same static data flow in many dense networks. However, challenges arise from the variance among the objects in images, such as large variations in the spatial scale and difficulties of recognition for visual entities. In this paper, we propose a data-dependent token routing strategy to elaborate the routing paths of image tokens for Dynamic Vision Transformer, dubbed DiT. The proposed framework generates a data-dependent path per token, adapting to the object scales and visual discrimination of tokens. In feed-forward, the differentiable routing gates are designed to select the scaling paths and feature transformation paths for image tokens, leading to multi-path feature propagation. In this way, the impact of object scales and visual discrimination of image representation can be carefully tuned. Moreover, the computational cost can be further reduced by giving budget constraints to the routing gate and early-stopping of feature extraction. In experiments, our DiT achieves superior performance and favorable complexity/accuracy trade-offs than many SoTA methods on ImageNet classification, object detection, instance segmentation, and semantic segmentation. Particularly, the DiT-B5 obtains 84.8% top-1 Acc on ImageNet with 10.3 GFLOPs, which is 1.0% higher than that of the SoTA method with similar computational complexity. These extensive results demonstrate that DiT can serve as versatile backbones for various vision tasks.

updated: Fri Aug 11 2023 13:53:19 GMT+0000 (UTC)

published: Mon Aug 07 2023 08:55:48 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト