Dense Contrastive Visual-Linguistic Pretraining

Lei Shi; Kai Shuang; Shijie Geng; Peng Gao; Zuohui Fu; Gerard de Melo; Yunpeng Chen; Sen Su

対照的な視覚言語学の事前トレーニング

BERTの成功に触発されて、画像とテキストを共同で表現するいくつかのマルチモーダル表現学習アプローチが提案されています。これらのアプローチは、大規模なマルチモーダル事前トレーニングから高レベルのセマンティック情報をキャプチャすることにより、優れたパフォーマンスを実現します。特に、LXMERTとUNITERは、口実タスクとして視覚領域特徴回帰とラベル分類を採用しています。ただし、限られた一貫性のないセマンティックラベリングを使用してクラウドソーシングされたデータセットで事前トレーニングされた視覚的特徴に基づいて、ノイズの多いラベルやスパースセマンティックアノテーションの問題に悩まされる傾向があります。これらの問題を克服するために、偏りのない高密度対照視覚言語事前トレーニング（DCVLP）を提案します。これは、領域の回帰と分類を、注釈を必要としないクロスモダリティ領域の対照学習に置き換えます。対照的な学習で使用されるネガティブサンプルの品質を向上させるために、2つのデータ拡張戦略（マスク摂動と敵対内/敵対間摂動）が開発されています。全体として、DCVLPは、オブジェクトの注釈に関係なく、自己監視設定でクロスモダリティの密な領域の対照学習を可能にします。マルチモーダル表現学習における高密度対照学習の優位性を検証するために、以前の視覚言語事前トレーニングフレームワークと私たちの方法を比較します。

Inspired by the success of BERT, several multimodal representation learning approaches have been proposed that jointly represent image and text. These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining. In particular, LXMERT and UNITER adopt visual region feature regression and label classification as pretext tasks. However, they tend to suffer from the problems of noisy labels and sparse semantic annotations, based on the visual features having been pretrained on a crowdsourced dataset with limited and inconsistent semantic labeling. To overcome these issues, we propose unbiased Dense Contrastive Visual-Linguistic Pretraining (DCVLP), which replaces the region regression and classification with cross-modality region contrastive learning that requires no annotations. Two data augmentation strategies (Mask Perturbation and Intra-/Inter-Adversarial Perturbation) are developed to improve the quality of negative samples used in contrastive learning. Overall, DCVLP allows cross-modality dense region contrastive learning in a self-supervised setting independent of any object annotations. We compare our method against prior visual-linguistic pretraining frameworks to validate the superiority of dense contrastive learning on multimodal representation learning.

updated: Fri Sep 24 2021 07:20:13 GMT+0000 (UTC)

published: Fri Sep 24 2021 07:20:13 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト