Towards Understanding Why Mask-Reconstruction Pretraining Helps in Downstream Tasks

Jiachun Pan; Pan Zhou; Shuicheng Yan

マスク再構成の事前トレーニングがダウンストリームタスクに役立つ理由の理解に向けて

教師なし事前トレーニングの場合、マスク再構成事前トレーニング（MRP）は、入力パッチをランダムにマスクしてから、オートエンコーダーを介してこれらのマスクされたパッチのピクセルまたはセマンティック機能を再構築します。次に、ダウンストリームタスクの場合、事前にトレーニングされたエンコーダーの教師あり微調整は、ゼロからトレーニングされた従来の教師あり学習（SL）を大幅に上回ります。ただし、1）MRPが事前トレーニングフェーズでセマンティック学習を実行する方法、および2）MRPがダウンストリームタスクで役立つ理由はまだ不明です。これらの問題を解決するために、2層/1層畳み込みエンコーダー/デコーダーのオートエンコーダーで、MRPが事前トレーニングデータセット内のすべての識別セマンティクスをキャプチャできることを理論的に示します。したがって、分類ダウンストリームタスクでSLよりも証明可能な改善を示します。。具体的には、事前トレーニングデータセットには、比率1μのマルチビューサンプルと比率μのシングルビューサンプルが含まれていると想定します。ここで、マルチ/シングルビューサンプルには複数/単一の識別セマンティクスがあります。次に、事前トレーニングのために、1）MRPエンコーダーの畳み込みカーネルが事前トレーニングデータ内のすべての識別セマンティクスをキャプチャすることを証明します。 2）畳み込みカーネルは最大で1つのセマンティクスをキャプチャします。したがって、ダウンストリームの監視対象の微調整では、ほとんどのセマンティクスがキャプチャされ、異なるセマンティクスが融合されることはありません。これは、ダウンストリームの微調整されたネットワークがカーネルとセマンティッククラスラベルの間の関係を簡単に確立するのに役立ちます。このようにして、MRPの微調整されたエンコーダーは、マルチビューとシングルビューの両方のテストデータに対して高い確率でゼロテストエラーを確実に達成します。対照的に、〜[3]で証明されているように、従来のSLは、シングルビューテストデータで約0.5μのテスト精度しか取得できません。これらの結果は、ダウンストリームタスクにおけるMRPの利点をまとめて説明しています。実験結果は、マルチビューデータの仮定と私たちの理論的意味を証明しています。

For unsupervised pretraining, mask-reconstruction pretraining (MRP) approaches randomly mask input patches and then reconstruct pixels or semantic features of these masked patches via an auto-encoder. Then for a downstream task, supervised fine-tuning the pretrained encoder remarkably surpasses the conventional supervised learning (SL) trained from scratch. However, it is still unclear 1) how MRP performs semantic learning in the pretraining phase and 2) why it helps in downstream tasks. To solve these problems, we theoretically show that on an auto-encoder of a two/one-layered convolution encoder/decoder, MRP can capture all discriminative semantics in the pretraining dataset, and accordingly show its provable improvement over SL on the classification downstream task. Specifically, we assume that pretraining dataset contains multi-view samples of ratio 1-μ and single-view samples of ratio μ, where multi/single-view samples has multiple/single discriminative semantics. Then for pretraining, we prove that 1) the convolution kernels of the MRP encoder captures all discriminative semantics in the pretraining data; and 2) a convolution kernel captures at most one semantic. Accordingly, in the downstream supervised fine-tuning, most semantics would be captured and different semantics would not be fused together. This helps the downstream fine-tuned network to easily establish the relation between kernels and semantic class labels. In this way, the fine-tuned encoder in MRP provably achieves zero test error with high probability for both multi-view and single-view test data. In contrast, as proved by~[3], conventional SL can only obtain a test accuracy between around 0.5μ for single-view test data. These results together explain the benefits of MRP in downstream tasks. Experimental results testify to multi-view data assumptions and our theoretical implications.

updated: Thu Jun 09 2022 01:46:19 GMT+0000 (UTC)

published: Wed Jun 08 2022 11:49:26 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト