CaSP: Class-agnostic Semi-Supervised Pretraining for Detection and Segmentation

Lu Qi; Jason Kuen; Zhe Lin; Jiuxiang Gu; Fengyun Rao; Dian Li; Weidong Guo; Zhen Wen; Jiaya Jia

CaSP：検出とセグメンテーションのためのクラスにとらわれない半教師あり事前トレーニング

インスタンスレベルの検出/セグメンテーションのパフォーマンスを向上させるために、既存の自己監視および半教師ありメソッドは、ラベルのないデータから非常にタスクに関連しない、または非常にタスク固有のトレーニング信号を抽出します。これらの2つのアプローチは、タスク固有のスペクトルの2つの極端な端で、タスクのパフォーマンスにとって最適ではないと主張します。タスク固有のトレーニング信号の使用が少なすぎると、ダウンストリームタスクのグラウンドトゥルースラベルに過剰適合し、逆にグラウンドトゥルースラベルに過剰適合します。この目的のために、ラベルのないデータからトレーニング信号を抽出する際に、より好ましいタスク固有のバランスを実現するために、新しいクラスにとらわれない半教師あり事前トレーニング（CaSP）フレームワークを提案します。半教師あり学習と比較して、CaSPは、疑似ラベルのクラス情報を無視し、タスクに関係のないラベルのないデータのみを使用する個別の事前トレーニングステージを設けることにより、トレーニング信号のタスクの特異性を低減します。一方、CaSPは、ボックス/マスクレベルの疑似ラベルを活用することにより、適切な量のタスクの特異性を維持します。その結果、事前にトレーニングされたモデルは、ダウンストリームタスクで微調整されたときに、グラウンドトゥルースラベルへのアンダーフィット/オーバーフィットをより適切に回避できます。 360万のラベルなしデータを使用して、オブジェクト検出でImageNetで事前トレーニングされたベースラインよりも4.7％という驚くべきパフォーマンスの向上を達成します。事前にトレーニングされたモデルは、他の検出およびセグメンテーションタスク/フレームワークへの優れた転送性も示しています。

To improve instance-level detection/segmentation performance, existing self-supervised and semi-supervised methods extract either very task-unrelated or very task-specific training signals from unlabeled data. We argue that these two approaches, at the two extreme ends of the task-specificity spectrum, are suboptimal for the task performance. Utilizing too little task-specific training signals causes underfitting to the ground-truth labels of downstream tasks, while the opposite causes overfitting to the ground-truth labels. To this end, we propose a novel Class-agnostic Semi-supervised Pretraining (CaSP) framework to achieve a more favorable task-specificity balance in extracting training signals from unlabeled data. Compared to semi-supervised learning, CaSP reduces the task specificity in training signals by ignoring class information in the pseudo labels and having a separate pretraining stage that uses only task-unrelated unlabeled data. On the other hand, CaSP preserves the right amount of task specificity by leveraging box/mask-level pseudo labels. As a result, our pretrained model can better avoid underfitting/overfitting to ground-truth labels when finetuned on the downstream task. Using 3.6M unlabeled data, we achieve a remarkable performance gain of 4.7% over ImageNet-pretrained baseline on object detection. Our pretrained model also demonstrates excellent transferability to other detection and segmentation tasks/frameworks.

updated: Thu Dec 09 2021 14:54:59 GMT+0000 (UTC)

published: Thu Dec 09 2021 14:54:59 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト