Revitalizing CNN Attentions via Transformers in Self-Supervised Visual Representation Learning

Chongjian Ge; Youwei Liang; Yibing Song; Jianbo Jiao; Jue Wang; Ping Luo

自己教師あり視覚表現学習におけるトランスフォーマーを介したCNNの注意の活性化

自己教師あり視覚表現学習（SSL）に関する研究は、エンコーダーのバックボーンを改善して、ラベルのないトレーニングサンプルを識別します。 SSLを介したCNNエンコーダーは、教師あり学習を介したものと同等の認識パフォーマンスを実現しますが、ネットワークへの注目は、さらなる改善のために十分に検討されていません。認識シナリオで視覚的注意を効果的に探索するトランスフォーマーに動機付けられて、SSLのトランスフォーマーによってガイドされる注意深いCNNエンコーダーをトレーニングするためのCNN Attention REvitalization（CARE）フレームワークを提案します。提案されているCAREフレームワークは、CNNストリーム（Cストリーム）とトランスフォーマーストリーム（Tストリーム）で構成されており、各ストリームには2つのブランチが含まれています。 C-streamは、2つのCNNエンコーダー、2つのプロジェクター、および予測子を備えた既存のSSLフレームワークに従います。 Tストリームには、2つのトランスフォーマー、2つのプロジェクター、および予測子が含まれています。 TストリームはCNNエンコーダーに接続し、残りのCストリームと並列になります。トレーニング中、両方のストリームでSSLを同時に実行し、Tストリーム出力を使用してCストリームを監視します。 CNNエンコーダーの機能は、視覚的注意を強化するためにTストリームで変調され、SSLシナリオに適したものになります。これらの変調された機能を使用して、注意深いCNNエンコーダーを学習するためのCストリームを監視します。この目的のために、トランスフォーマーをガイダンスとして使用することにより、CNNの注意を活性化します。画像分類、オブジェクト検出、セマンティックセグメンテーションなど、いくつかの標準的な視覚認識ベンチマークに関する実験では、提案されたCAREフレームワークがCNNエンコーダバックボーンを最先端のパフォーマンスに向上させることが示されています。

Studies on self-supervised visual representation learning (SSL) improve encoder backbones to discriminate training samples without labels. While CNN encoders via SSL achieve comparable recognition performance to those via supervised learning, their network attention is under-explored for further improvement. Motivated by the transformers that explore visual attention effectively in recognition scenarios, we propose a CNN Attention REvitalization (CARE) framework to train attentive CNN encoders guided by transformers in SSL. The proposed CARE framework consists of a CNN stream (C-stream) and a transformer stream (T-stream), where each stream contains two branches. C-stream follows an existing SSL framework with two CNN encoders, two projectors, and a predictor. T-stream contains two transformers, two projectors, and a predictor. T-stream connects to CNN encoders and is in parallel to the remaining C-Stream. During training, we perform SSL in both streams simultaneously and use the T-stream output to supervise C-stream. The features from CNN encoders are modulated in T-stream for visual attention enhancement and become suitable for the SSL scenario. We use these modulated features to supervise C-stream for learning attentive CNN encoders. To this end, we revitalize CNN attention by using transformers as guidance. Experiments on several standard visual recognition benchmarks, including image classification, object detection, and semantic segmentation, show that the proposed CARE framework improves CNN encoder backbones to the state-of-the-art performance.

updated: Mon Oct 11 2021 15:08:15 GMT+0000 (UTC)

published: Mon Oct 11 2021 15:08:15 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト