Dynamic Token Normalization Improves Vision Transformers

Wenqi Shao; Yixiao Ge; Zhaoyang Zhang; Xuyuan Xu; Xiaogang Wang; Ying Shan; Ping Luo

動的トークンの正規化により、ビジョントランスフォーマーが改善されます

ビジョントランスフォーマー (ViT) とその亜種 (Swin、PVT など) は、さまざまなコンピュータービジョンタスクで大きな成功を収めています。これは、長距離のコンテキスト情報を学習する能力があるためです。レイヤーの正規化 (LN) は、これらのモデルに不可欠な要素です。ただし、通常の LN は、各トークン内の埋め込みを正規化するため、異なる位置のトークンを同じ大きさにすることがわかりました。トランスフォーマーが、LN を使用した画像内の位置コンテキストなどの誘導バイアスをキャプチャすることは困難です。私たちは、動的トークン正規化 (DTN) と呼ばれる新しいノーマライザーを提案することで、この問題に取り組みます。この正規化は、各トークン内 (トークン内) と異なるトークン間 (トークン間) の両方で実行されます。 DTN にはいくつかのメリットがあります。まず、統一された定式化に基づいて構築されているため、さまざまな既存の正規化方法を表すことができます。次に、DTN はトークン内およびトークン間の両方の方法でトークンを正規化することを学習し、トランスフォーマーがグローバルなコンテキスト情報とローカルの位置コンテキストの両方を取得できるようにします。第 3 に、LN レイヤーを置き換えるだけで、DTN を ViT、Swin、PVT、LeViT、T2T-ViT、BigBird、Reformer などのさまざまなビジョントランスフォーマーに簡単に接続できます。大規模な実験により、DTN を備えたトランスフォーマーは、最小の追加パラメーターと計算オーバーヘッドで、一貫してベースラインモデルよりも優れていることが示されています。たとえば、DTN は、ImageNet で 0.5% ～ 1.2% のトップ 1 精度で LN を上回り、COCO ベンチマークでの物体検出で 1.2 ～ 1.4 ボックス AP、ImageNet-C でのロバストネス実験で 2.3% ～ 3.9% mCE、0.5% 優れています。 % - Long-Range Arena の Long ListOps で 0.8% の精度。コードは https://github.com/wqshao126/DTN で公開されます

Vision Transformer (ViT) and its variants (e.g., Swin, PVT) have achieved great success in various computer vision tasks, owing to their capability to learn long-range contextual information. Layer Normalization (LN) is an essential ingredient in these models. However, we found that the ordinary LN makes tokens at different positions similar in magnitude because it normalizes embeddings within each token. It is difficult for Transformers to capture inductive bias such as the positional context in an image with LN. We tackle this problem by proposing a new normalizer, termed Dynamic Token Normalization (DTN), where normalization is performed both within each token (intra-token) and across different tokens (inter-token). DTN has several merits. Firstly, it is built on a unified formulation and thus can represent various existing normalization methods. Secondly, DTN learns to normalize tokens in both intra-token and inter-token manners, enabling Transformers to capture both the global contextual information and the local positional context. Thirdly, by simply replacing LN layers, DTN can be readily plugged into various vision transformers, such as ViT, Swin, PVT, LeViT, T2T-ViT, BigBird and Reformer. Extensive experiments show that the transformer equipped with DTN consistently outperforms baseline model with minimal extra parameters and computational overhead. For example, DTN outperforms LN by 0.5% - 1.2% top-1 accuracy on ImageNet, by 1.2 - 1.4 box AP in object detection on COCO benchmark, by 2.3% - 3.9% mCE in robustness experiments on ImageNet-C, and by 0.5% - 0.8% accuracy in Long ListOps on Long-Range Arena. Codes will be made public at https://github.com/wqshao126/DTN

updated: Fri Oct 14 2022 05:25:34 GMT+0000 (UTC)

published: Sun Dec 05 2021 17:04:59 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト