MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining

Xiaoyi Dong; Jianmin Bao; Yinglin Zheng; Ting Zhang; Dongdong Chen; Hao Yang; Ming Zeng; Weiming Zhang; Lu Yuan; Dong Chen; Fang Wen; Nenghai Yu

MaskCLIP: マスクされた自己蒸留は、対照的な言語イメージの事前トレーニングを促進します

このペーパーでは、新しく提案されたマスクされた自己蒸留を対照的な言語イメージの事前トレーニングに組み込んだ、シンプルで効果的なフレームワーク MaskCLIP を紹介します。マスクされた自己蒸留の核となるアイデアは、完全な画像からマスクされた画像から予測された表現へと表現を抽出することです。このような法人化には、2 つの重要な利点があります。まず、マスクされた自己蒸留は、テキスト関連の表現に焦点を当てた対照的な視覚言語を補完するローカルパッチ表現学習を対象としています。第二に、マスクされた自己蒸留は、視覚的エンコーダーを使用して機能を整列させるため、トレーニング目的の観点から対照的な視覚言語とも一致し、言語から間接的な監督を受けてローカルセマンティクスを学習できます。 2 つの利点を検証するための包括的な分析を備えた特別に設計された実験を提供します。対称的に、ローカルのセマンティック監視もテキストブランチに導入します。これにより、事前トレーニングのパフォーマンスがさらに向上します。大規模な実験により、MaskCLIP をさまざまな困難なダウンストリームタスクに適用すると、言語エンコーダーのガイダンスにより、リニアプロービング、微調整、およびゼロショットパフォーマンスで優れた結果が得られることが示されます。コードは https://github.com/LightDXY/MaskCLIP で公開されます。

This paper presents a simple yet effective framework MaskCLIP, which incorporates a newly proposed masked self-distillation into contrastive language-image pretraining. The core idea of masked self-distillation is to distill representation from a full image to the representation predicted from a masked image. Such incorporation enjoys two vital benefits. First, masked self-distillation targets local patch representation learning, which is complementary to vision-language contrastive focusing on text-related representation. Second, masked self-distillation is also consistent with vision-language contrastive from the perspective of training objective as both utilize the visual encoder for feature aligning, and thus is able to learn local semantics getting indirect supervision from the language. We provide specially designed experiments with a comprehensive analysis to validate the two benefits. Symmetrically, we also introduce the local semantic supervision into the text branch, which further improves the pretraining performance. With extensive experiments, we show that MaskCLIP, when applied to various challenging downstream tasks, achieves superior results in linear probing, finetuning, and zero-shot performance with the guidance of the language encoder. Code will be release at https://github.com/LightDXY/MaskCLIP.

updated: Sun Apr 09 2023 15:59:26 GMT+0000 (UTC)

published: Thu Aug 25 2022 17:59:58 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト