Position Embedding Needs an Independent Layer Normalization

Runyi Yu; Zhennan Wang; Yinhuai Wang; Kehan Li; Yian Zhao; Jian Zhang; Guoli Song; Jie Chen

位置の埋め込みには、独立したレイヤーの正規化が必要です

位置埋め込み (PE) は、自己注意操作の順列不変性により、ビジョントランスフォーマー (VT) にとって重要です。再パラメーター化と視覚化を使用して VT の各エンコーダー層の入力と出力を分析することにより、デフォルトの PE 結合方法 (単純に PE とパッチ埋め込みを一緒に追加する) が、トークン埋め込みと PE に対して同じアフィン変換を実行し、表現力を制限することがわかります。 PE のパフォーマンスを制限するため、VT のパフォーマンスが制限されます。この制限を克服するために、シンプルで効果的で堅牢な方法を提案します。具体的には、トークンの埋め込みと各レイヤーの PE に対して 2 つの独立したレイヤー正規化を提供し、それらを各レイヤーの Muti-Head Self-Attention モジュールの入力として追加します。この方法により、モデルはさまざまなレイヤーの PE の情報を適応的に調整できるため、LaPE と略されるレイヤー適応位置埋め込みと名付けます。広範な実験により、LaPE はさまざまなタイプの PE を使用してさまざまな VT を改善し、VT を PE タイプに対して堅牢にすることができることが実証されています。たとえば、LaPE は、Cifar10 の ViT-Lite で 0.94%、Cifar100 の CCT で 0.98%、ImageNet-1K の DeiT で 1.72% の精度を向上させます。これは、LaPE によってもたらされる無視できる追加のパラメーター、メモリ、計算コストを考慮すると驚くべきことです。コードは https://github.com/Ingrid725/LaPE で公開されています。

The Position Embedding (PE) is critical for Vision Transformers (VTs) due to the permutation-invariance of self-attention operation. By analyzing the input and output of each encoder layer in VTs using reparameterization and visualization, we find that the default PE joining method (simply adding the PE and patch embedding together) operates the same affine transformation to token embedding and PE, which limits the expressiveness of PE and hence constrains the performance of VTs. To overcome this limitation, we propose a simple, effective, and robust method. Specifically, we provide two independent layer normalizations for token embeddings and PE for each layer, and add them together as the input of each layer's Muti-Head Self-Attention module. Since the method allows the model to adaptively adjust the information of PE for different layers, we name it as Layer-adaptive Position Embedding, abbreviated as LaPE. Extensive experiments demonstrate that LaPE can improve various VTs with different types of PE and make VTs robust to PE types. For example, LaPE improves 0.94% accuracy for ViT-Lite on Cifar10, 0.98% for CCT on Cifar100, and 1.72% for DeiT on ImageNet-1K, which is remarkable considering the negligible extra parameters, memory and computational cost brought by LaPE. The code is publicly available at https://github.com/Ingrid725/LaPE.

updated: Thu Dec 22 2022 08:27:56 GMT+0000 (UTC)

published: Sat Dec 10 2022 10:38:00 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト