Vision Transformer: Vit and its Derivatives

Zujun Fu

Vision Transformer：Vitとその派生物

注意ベースのエンコーダ-デコーダアーキテクチャであるTransformerは、自然言語処理（NLP）の分野に革命をもたらしただけでなく、コンピュータビジョン（CV）の分野でも先駆的な仕事をしました。畳み込みニューラルネットワーク（CNN）と比較して、Vision Transformer（ViT）は、ImageNet、COCO、ADE20kなどのいくつかのベンチマークで非常に優れたパフォーマンスを実現するために優れたモデリング機能に依存しています。 ViTは、単語の埋め込みがパッチの埋め込みに置き換えられる自然言語処理の自己注意メカニズムに触発されています。このホワイトペーパーでは、ViTの派生物と、ViTと他の分野とのクロスアプリケーションについて概説します。

Transformer, an attention-based encoder-decoder architecture, has not only revolutionized the field of natural language processing (NLP), but has also done some pioneering work in the field of computer vision (CV). Compared to convolutional neural networks (CNNs), the Vision Transformer (ViT) relies on excellent modeling capabilities to achieve very good performance on several benchmarks such as ImageNet, COCO, and ADE20k. ViT is inspired by the self-attention mechanism in natural language processing, where word embeddings are replaced with patch embeddings. This paper reviews the derivatives of ViT and the cross-applications of ViT with other fields.

updated: Tue May 24 2022 14:08:01 GMT+0000 (UTC)

published: Thu May 12 2022 14:02:39 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト