Towards Understanding Why Mask-Reconstruction Pretraining Helps in Downstream Tasks

Jiachun Pan; Pan Zhou; Shuicheng Yan

マスク再構成の事前トレーニングが下流のタスクに役立つ理由の理解に向けて

教師なし事前トレーニングの場合、MAE や data2vec などのマスク再構築事前トレーニング (MRP) アプローチは、入力パッチをランダムにマスクし、オートエンコーダーを介してこれらのマスクされたパッチのピクセルまたはセマンティック特徴を再構築します。次に、ダウンストリームタスクの場合、事前トレーニング済みエンコーダーの教師あり微調整は、ゼロからトレーニングした従来の「教師あり学習」(SL) を著しく上回ります。ただし、1) MRP が事前トレーニング段階でセマンティックな特徴学習をどのように実行するか、および 2) 下流のタスクに役立つ理由はまだ不明です。これらの問題を解決するために、最初に、2/1 層畳み込みエンコーダー/デコーダーの自動エンコーダーで、MRP が事前トレーニングデータセット内の潜在的な各セマンティッククラスのすべての識別機能をキャプチャできることを理論的に示します。次に、事前トレーニングデータセットが巨大なサイズで多様性が高く、ダウンストリームデータセットのほとんどの機能をカバーしているという事実を考慮して、微調整段階で、事前トレーニング済みエンコーダーはダウンストリームデータセットで可能な限り多くの機能をキャプチャでき、これらを失うことはありません。理論的に保証された機能。対照的に、SL は宝くじの仮説により、一部の機能をランダムにキャプチャするだけです。したがって、MRP は、分類タスクで SL よりも優れたパフォーマンスを達成することが証明されています。実験結果は、データの仮定と理論的な意味を証明しています。

For unsupervised pretraining, mask-reconstruction pretraining (MRP) approaches, e.g. MAE and data2vec, randomly mask input patches and then reconstruct the pixels or semantic features of these masked patches via an auto-encoder. Then for a downstream task, supervised fine-tuning the pretrained encoder remarkably surpasses the conventional ``supervised learning'' (SL) trained from scratch. However, it is still unclear 1) how MRP performs semantic feature learning in the pretraining phase and 2) why it helps in downstream tasks. To solve these problems, we first theoretically show that on an auto-encoder of a two/one-layered convolution encoder/decoder, MRP can capture all discriminative features of each potential semantic class in the pretraining dataset. Then considering the fact that the pretraining dataset is of huge size and high diversity and thus covers most features in downstream dataset, in fine-tuning phase, the pretrained encoder can capture as much features as it can in downstream datasets, and would not lost these features with theoretical guarantees. In contrast, SL only randomly captures some features due to lottery ticket hypothesis. So MRP provably achieves better performance than SL on the classification tasks. Experimental results testify to our data assumptions and also our theoretical implications.

updated: Sat Feb 11 2023 13:19:06 GMT+0000 (UTC)

published: Wed Jun 08 2022 11:49:26 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト