Could Giant Pretrained Image Models Extract Universal Representations?

Yutong Lin; Ze Liu; Zheng Zhang; Han Hu; Nanning Zheng; Stephen Lin; Yue Cao

巨大な事前訓練された画像モデルは普遍的な表現を抽出できますか?

凍結された事前トレーニング済みモデルは、転移学習の事前トレーニング後に微調整するパラダイムに代わる実行可能な代替手段になりました。ただし、凍結されたモデルでは、ダウンストリームタスクに適応するために使用できるパラメーターが比較的少ないため、タスクの入出力形式や価値のある情報の種類が大幅に異なるコンピュータービジョンでは問題になります。このホワイトペーパーでは、オブジェクト検出、セマンティックセグメンテーション、ビデオアクション認識など、多様で代表的なコンピュータービジョンタスクに適用された場合の、凍結された事前トレーニング済みモデルの研究を紹介します。この経験的分析から、私たちの作業は、この凍結設定に最も適した事前トレーニングタスク、さまざまなダウンストリームタスクに対して凍結設定をより柔軟にする方法、およびより大きなモデルサイズの影響に関する質問に答えます。さらに、30 億個のパラメーターを持つ巨大な凍結済み事前トレーニング済みモデル (SwinV2-G) を使用してパフォーマンスの上限を調べ、1 つの共有凍結ベースネットワーク (60.0 ボックス mAP および 52.2) のみを使用して、主要なベンチマークのさまざまなセットで競争力のあるパフォーマンスに到達することを発見しました。 COCOオブジェクト検出テスト開発でmAPをマスクし、ADE20Kセマンティックセグメンテーションで57.6 val mIoU、およびKinetics-400アクション認識で81.7トップ1精度。この作業により、事前トレーニング済みの画像モデルを凍結するこの有望な道に、より大きな注目を集めることを願っています。

Frozen pretrained models have become a viable alternative to the pretraining-then-finetuning paradigm for transfer learning. However, with frozen models there are relatively few parameters available for adapting to downstream tasks, which is problematic in computer vision where tasks vary significantly in input/output format and the type of information that is of value. In this paper, we present a study of frozen pretrained models when applied to diverse and representative computer vision tasks, including object detection, semantic segmentation and video action recognition. From this empirical analysis, our work answers the questions of what pretraining task fits best with this frozen setting, how to make the frozen setting more flexible to various downstream tasks, and the effect of larger model sizes. We additionally examine the upper bound of performance using a giant frozen pretrained model with 3 billion parameters (SwinV2-G) and find that it reaches competitive performance on a varied set of major benchmarks with only one shared frozen base network: 60.0 box mAP and 52.2 mask mAP on COCO object detection test-dev, 57.6 val mIoU on ADE20K semantic segmentation, and 81.7 top-1 accuracy on Kinetics-400 action recognition. With this work, we hope to bring greater attention to this promising path of freezing pretrained image models.

updated: Thu Nov 03 2022 17:57:10 GMT+0000 (UTC)

published: Thu Nov 03 2022 17:57:10 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト