Localformer: a Locality-Preserving Vision Transformer

Qingsong Zhao; Zhipeng Zhou; Yi Wang; Yu Qiao; Cairong Zhao

Localformer: 局所性を維持するビジョントランスフォーマー

Zigzag flattening (ZF) は、コンピュータービジョンで行列を展開するための既定のオプションとして一般的に使用されます。たとえば、Vision Transformer (ViT) のパッチスライスなどです。ただし、マルチスケールオブジェクトの Web 画像を分解する場合、ZF はローカル情報の滑らかさをうまく維持できません。これに対処するために、空間充填曲線 (SFC) からインスピレーションを得て、視覚モデルの代替としてヒルベルト平坦化 (HF) を調査します。包括的な理論的議論と実用的な分析を提供し、局所性とマルチスケールの堅牢性において他の SFC に対する HF の優位性を実証します。 HF を活用して、Localformer を定式化する ViT の浅い層に局所性バイアスがないという問題を軽減します。広範な実験により、Localformer がいくつかの一般的な視覚タスクのパフォーマンスを一貫して改善することが実証されています。さらに、調べてみると、Localformer が ViT の表現学習と長さ外挿能力を強化することがわかりました。

Zigzag flattening (ZF) is commonly used in computer vision as a default option to unfold matrices, e.g. in patch slicing for Vision Transformer (ViT). However, when decomposing multi-scale-object web images, ZF cannot preserve the smoothness of local information well. To address this, we draw inspiration from Space-Filling Curves (SFC) and investigate Hilbert flattening (HF) as an alternative for visual models. We provide a comprehensive theoretical discussion and practical analysis, demonstrating the superiority of HF over other SFC in locality and multi-scale robustness. We leverage HF to alleviate the problem of the lack of locality bias in the shallow layers of ViT, which formulates our Localformer. Extensive experiments demonstrate that Localformer consistently improves performance for several common visual tasks. Additionally, upon inspection, we find that Localformer enhances representation learning and length extrapolation abilities of ViT.

updated: Sun Apr 23 2023 11:04:22 GMT+0000 (UTC)

published: Mon Feb 21 2022 13:53:04 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト