Patch-level Representation Learning for Self-supervised Vision Transformers

Sukmin Yun; Hankook Lee; Jaehyung Kim; Jinwoo Shin

自己監視型ビジョントランスフォーマーのためのパッチレベルの表現学習

最近の自己監視学習（SSL）手法は、ラベルのない画像から視覚的表現を学習するという印象的な結果を示しています。このペーパーは、SSLの現在の最先端の視覚的口実タスクには利点がない、つまりアーキテクチャに依存しないため、基盤となるニューラルネットワークのアーキテクチャ上の利点を利用してパフォーマンスをさらに向上させることを目的としています。特に、より優れたアーキテクチャの選択肢として最近注目を集めているVision Transformers（ViT）に焦点を当てています。これは、さまざまな視覚的タスクで畳み込みネットワークよりも優れていることがよくあります。 ViTのユニークな特徴は、画像から一連のばらばらのパッチを取得し、パッチレベルの表現を内部で処理することです。これに触発されて、より良いパッチレベルの表現を学習するために、シンプルでありながら効果的な視覚的な口実タスク、造られたSelfPatchを設計します。具体的には、各パッチとその隣接パッチに対して不変性を適用します。つまり、各パッチは同様の隣接パッチをポジティブサンプルとして扱います。その結果、SelfPatchを使用してViTをトレーニングすると、パッチ間の意味的に意味のある関係が（人間が注釈を付けたラベルを使用せずに）学習します。これは、特に高密度予測タイプのダウンストリームタスクに役立ちます。その単純さにもかかわらず、オブジェクト検出やセマンティックセグメンテーションなどのさまざまな視覚的タスクに対する既存のSSLメソッドのパフォーマンスを大幅に向上させることができることを示しています。具体的には、SelfPatchは、COCOオブジェクト検出で+1.3 AP、COCOインスタンスセグメンテーションで+1.2 AP、ADE20Kセマンティックセグメンテーションで+2.9 mIoUを達成することにより、最近の自己監視型ViTであるDINOを大幅に改善します。

Recent self-supervised learning (SSL) methods have shown impressive results in learning visual representations from unlabeled images. This paper aims to improve their performance further by utilizing the architectural advantages of the underlying neural network, as the current state-of-the-art visual pretext tasks for SSL do not enjoy the benefit, i.e., they are architecture-agnostic. In particular, we focus on Vision Transformers (ViTs), which have gained much attention recently as a better architectural choice, often outperforming convolutional networks for various visual tasks. The unique characteristic of ViT is that it takes a sequence of disjoint patches from an image and processes patch-level representations internally. Inspired by this, we design a simple yet effective visual pretext task, coined SelfPatch, for learning better patch-level representations. To be specific, we enforce invariance against each patch and its neighbors, i.e., each patch treats similar neighboring patches as positive samples. Consequently, training ViTs with SelfPatch learns more semantically meaningful relations among patches (without using human-annotated labels), which can be beneficial, in particular, to downstream tasks of a dense prediction type. Despite its simplicity, we demonstrate that it can significantly improve the performance of existing SSL methods for various visual tasks, including object detection and semantic segmentation. Specifically, SelfPatch significantly improves the recent self-supervised ViT, DINO, by achieving +1.3 AP on COCO object detection, +1.2 AP on COCO instance segmentation, and +2.9 mIoU on ADE20K semantic segmentation.

updated: Fri Jun 17 2022 01:35:03 GMT+0000 (UTC)

published: Thu Jun 16 2022 08:01:19 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト