Jigsaw-ViT: Learning Jigsaw Puzzles in Vision Transformer

Yingyi Chen; Xi Shen; Yahui Liu; Qinghua Tao; Johan A. K. Suykens

ジグソーパズル-ViT：VisionTransformerでジグソーパズルを学ぶ

さまざまなコンピュータービジョンタスクでのVisionTransformer（ViT）の成功により、この畳み込みのないネットワークの普及がますます進んでいます。 ViTが画像パッチで機能するという事実は、ジグソーパズルの解決の問題に関連する可能性があります。これは、シャッフルされた連続画像パッチを自然な形に並べ替えることを目的とした古典的な自己監視タスクです。その単純さにもかかわらず、ジグソーパズルを解くことは、自己監視型の特徴表現学習、ドメインの一般化、きめ細かい分類など、畳み込みニューラルネットワーク（CNN）を使用したさまざまなタスクに役立つことが実証されています。この論文では、ジグソーパズルを、画像分類のためのViTにおける自己監視補助損失として解決することを検討します。これはJigsaw-ViTという名前です。 Jigsaw-ViTを標準のViTよりも優れたものにすることができる2つの変更を示します。位置の埋め込みを破棄することとパッチをランダムにマスキングすることです。簡単ですが、Jigsaw-ViTは、通常はかなりトレードオフである標準のViTよりも一般化と堅牢性の両方を向上させることができます。実験的に、ジグソーパズルブランチを追加すると、ImageNetでの大規模な画像分類においてViTよりも優れた一般化が提供されることを示します。さらに、補助タスクは、Animal-10N、Food-101N、Clothing1M、および敵対的な例のノイズの多いラベルに対する堅牢性も向上させます。私たちの実装はhttps://yingyichen-cyy.github.io/Jigsaw-ViT/で入手できます。

The success of Vision Transformer (ViT) in various computer vision tasks has promoted the ever-increasing prevalence of this convolution-free network. The fact that ViT works on image patches makes it potentially relevant to the problem of jigsaw puzzle solving, which is a classical self-supervised task aiming at reordering shuffled sequential image patches back to their natural form. Despite its simplicity, solving jigsaw puzzle has been demonstrated to be helpful for diverse tasks using Convolutional Neural Networks (CNNs), such as self-supervised feature representation learning, domain generalization, and fine-grained classification. In this paper, we explore solving jigsaw puzzle as a self-supervised auxiliary loss in ViT for image classification, named Jigsaw-ViT. We show two modifications that can make Jigsaw-ViT superior to standard ViT: discarding positional embeddings and masking patches randomly. Yet simple, we find that Jigsaw-ViT is able to improve both in generalization and robustness over the standard ViT, which is usually rather a trade-off. Experimentally, we show that adding the jigsaw puzzle branch provides better generalization than ViT on large-scale image classification on ImageNet. Moreover, the auxiliary task also improves robustness to noisy labels on Animal-10N, Food-101N, and Clothing1M as well as adversarial examples. Our implementation is available at https://yingyichen-cyy.github.io/Jigsaw-ViT/.

updated: Mon Jul 25 2022 08:18:18 GMT+0000 (UTC)

published: Mon Jul 25 2022 08:18:18 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト