EfficientCLIP: Efficient Cross-Modal Pre-training by Ensemble Confident Learning and Language Modeling

Jue Wang; Haofan Wang; Jincan Deng; Weijia Wu; Debing Zhang

EfficientCLIP：Ensembleの自信を持った学習と言語モデリングによる効率的なクロスモーダル事前トレーニング

大規模な事前トレーニングは、ビジョンと言語のギャップを埋めるという大きな成果を達成しましたが、それでもいくつかの課題に直面しています。まず、事前トレーニングのコストが高くなります。次に、モデルのパフォーマンスを低下させるデータノイズを処理する効率的な方法がありません。第3に、以前の方法では、限られた画像とテキストのペアデータのみを活用し、より豊富なシングルモーダルデータを無視します。これにより、シングルモーダルダウンストリームタスクへの一般化が不十分になる可能性があります。この作業では、ノイズの少ないデータサブセットを取得するために、Ensemble ConfidentLearningを介してEfficientCLIPメソッドを提案します。テキストブランチの一般化を促進するために、非常に豊富なペアになっていないシングルモーダルテキストデータが使用されます。 CLIPやWenLanと比較してわずか1/10のトレーニングリソースで、中国語のクロスモーダル検索タスクで最先端のパフォーマンスを実現すると同時に、テキスト検索やテキスト分類などのシングルモーダルタスクへの優れた一般化を示しています。

While large scale pre-training has achieved great achievements in bridging the gap between vision and language, it still faces several challenges. First, the cost for pre-training is expensive. Second, there is no efficient way to handle the data noise which degrades model performance. Third, previous methods only leverage limited image-text paired data, while ignoring richer single-modal data, which may result in poor generalization to single-modal downstream tasks. In this work, we propose an EfficientCLIP method via Ensemble Confident Learning to obtain a less noisy data subset. Extra rich non-paired single-modal text data is used for boosting the generalization of text branch. We achieve the state-of-the-art performance on Chinese cross-modal retrieval tasks with only 1/10 training resources compared to CLIP and WenLan, while showing excellent generalization to single-modal tasks, including text retrieval and text classification.

updated: Wed Sep 22 2021 11:13:48 GMT+0000 (UTC)

published: Fri Sep 10 2021 07:09:39 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト