Patch Is Not All You Need

Changzhen Li; Jie Zhang; Yang Wei; Zhilong Ji; Jinfeng Bai; Shiguang Shan

必要なのはパッチだけではありません

Vision Transformers は、コンピュータービジョンで大きな成功を収め、さまざまなタスクにわたって優れたパフォーマンスを提供します。ただし、それらの本質的な逐次入力への依存により、画像を手動でパッチシーケンスに分割する必要が生じ、画像本来の構造的および意味的な連続性が損なわれます。これを処理するために、画像を Transformer 入力のパターンシーケンスに適応的に変換する新しいパターントランスフォーマー (Patternformer) を提案します。具体的には、畳み込みニューラルネットワークを使用して、入力画像からさまざまなパターンを抽出します。各チャネルは、視覚的なトークンとして後続の Transformer に供給される固有のパターンを表します。ネットワークがこれらのパターンを最適化できるようにすることで、各パターンは局所的な関心領域に集中し、それによってその固有の構造情報と意味情報が保存されます。バニラの ResNet と Transformer のみを使用することで、CIFAR-10 と CIFAR-100 で最先端のパフォーマンスを達成し、ImageNet で競争力のある結果を達成しました。

Vision Transformers have achieved great success in computer visions, delivering exceptional performance across various tasks. However, their inherent reliance on sequential input enforces the manual partitioning of images into patch sequences, which disrupts the image's inherent structural and semantic continuity. To handle this, we propose a novel Pattern Transformer (Patternformer) to adaptively convert images to pattern sequences for Transformer input. Specifically, we employ the Convolutional Neural Network to extract various patterns from the input image, with each channel representing a unique pattern that is fed into the succeeding Transformer as a visual token. By enabling the network to optimize these patterns, each pattern concentrates on its local region of interest, thereby preserving its intrinsic structural and semantic information. Only employing the vanilla ResNet and Transformer, we have accomplished state-of-the-art performance on CIFAR-10 and CIFAR-100, and have achieved competitive results on ImageNet.

updated: Mon Aug 21 2023 13:54:00 GMT+0000 (UTC)

published: Mon Aug 21 2023 13:54:00 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト