Transformer-CNN Cohort: Semi-supervised Semantic Segmentation by the Best of Both Students

Xu Zheng; Yunhao Luo; Hao Wang; Chong Fu; Lin Wang

Transformer-CNN コホート: 両方の学生のベストによる半教師付きセマンティックセグメンテーション

半教師付きセマンティックセグメンテーションの一般的な方法は、ほとんどの場合、畳み込みニューラルネットワーク (CNN) を使用したユニタリネットワークモデルを採用し、入力またはモデルに適用される小さな摂動に対してモデル予測の一貫性を強制します。ただし、このような学習パラダイムには、a) CNN ベースのモデルの学習能力の制限があります。 b) ラベル付けされていないデータの識別機能を学習する能力が限られている。 c) 画像全体からのグローバル情報とローカル情報の両方に対する限定的な学習。この論文では、トランスフォーマー-CNN コホート (TCC) と呼ばれる新しい半教師あり学習アプローチを提案します。これは、ビジョントランスフォーマー (ViT) に基づくものと CNN に基づくものを持つ 2 人の学生で構成されます。私たちの方法は、ラベル付けされていないデータの疑似ラベル付けを介して、予測と異種特徴空間にマルチレベルの一貫性のある正則化を微妙に組み込んでいます。まず、ViT 学生の入力は画像パッチであるため、抽出された特徴マップは重要なクラスごとの統計をエンコードします。この目的のために、最初に各学生の出力を疑似ラベルとして活用し、クラス対応機能 (CF) マップを生成する、クラス対応機能一貫性蒸留 (CFCD) を提案します。次に、学生間で CF マップを介して知識を伝達します。第二に、ViT の学生はすべてのレイヤーでより均一な表現を持っているため、一貫性を意識したクロス蒸留を提案して、コホートからのピクセル単位の予測間で知識を転送します。 Cityscapes および Pascal VOC 2012 データセットで TCC フレームワークを検証します。これは、既存の半教師付きメソッドよりも大幅に優れています。

The popular methods for semi-supervised semantic segmentation mostly adopt a unitary network model using convolutional neural networks (CNNs) and enforce consistency of the model predictions over small perturbations applied to the inputs or model. However, such a learning paradigm suffers from a) limited learning capability of the CNN-based model; b) limited capacity of learning the discriminative features for the unlabeled data; c) limited learning for both global and local information from the whole image. In this paper, we propose a novel Semi-supervised Learning approach, called Transformer-CNN Cohort (TCC), that consists of two students with one based on the vision transformer (ViT) and the other based on the CNN. Our method subtly incorporates the multi-level consistency regularization on the predictions and the heterogeneous feature spaces via pseudo labeling for the unlabeled data. First, as the inputs of the ViT student are image patches, the feature maps extracted encode crucial class-wise statistics. To this end, we propose class-aware feature consistency distillation (CFCD) that first leverages the outputs of each student as the pseudo labels and generates class-aware feature (CF) maps. It then transfers knowledge via the CF maps between the students. Second, as the ViT student has more uniform representations for all layers, we propose consistency-aware cross distillation to transfer knowledge between the pixel-wise predictions from the cohort. We validate the TCC framework on Cityscapes and Pascal VOC 2012 datasets, which significantly outperforms existing semi-supervised methods by a large margin.

updated: Tue Sep 06 2022 02:11:08 GMT+0000 (UTC)

published: Tue Sep 06 2022 02:11:08 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト