Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning

Yatai Ji; Rongcheng Tu; Jie Jiang; Weijie Kong; Chengfei Cai; Wenzhe Zhao; Hongfa Wang; Yujiu Yang; Wei Liu

見逃したものを見る: セマンティック補完学習による視覚言語の事前トレーニング

視覚言語事前トレーニング (VLP) モデルがさまざまなモダリティ間で対応する正しい情報を学習するには、クロスモーダルアラインメントが不可欠です。この目的のために、NLP 事前トレーニング領域でのマスク言語モデリング (MLM) タスクの成功に触発されて、クロスモーダルインタラクションをさらに促進するために、VLP 用に多数のマスクモデリングタスクが提案されています。以前のマスクされたモデリングタスクの中心となるアイデアは、ローカル間のアラインメントを学習するために、目に見えるコンテキストに基づいてマスクされたトークンを再構築することに焦点を当てることです。ただし、それらのほとんどは、マスクされたデータに対して生成されたグローバルなセマンティック機能にほとんど注意を払っていないため、グローバル表現のクロスモーダルアラインメント機能が制限されています。したがって、この論文では、グローバルからローカルへのアラインメントを容易にするために、既存のマスクされたモデリングタスクを補完する、新しいセマンティック補完学習 (SCL) タスクを提案します。具体的には、SCL タスクは、他のモダリティから対応する情報を取得することで、マスクされたデータの欠落しているセマンティクスを補完し、ダウンストリームタスクのパフォーマンスに大きな影響を与える、より代表的なグローバル機能の学習を促進します。さらに、モデルが画像テキストとビデオテキストのマルチモーダルタスクを同時に実行できる柔軟なビジョンエンコーダーを提示します。実験結果は、提案された方法が、視覚的質問応答、画像テキスト検索、ビデオテキスト検索などのさまざまな視覚言語ベンチマークで最先端のパフォーマンスを発揮することを示しています。

Cross-modal alignment is essential for vision-language pre-training (VLP) models to learn the correct corresponding information across different modalities. For this purpose, inspired by the success of masked language modeling (MLM) tasks in the NLP pre-training area, numerous masked modeling tasks have been proposed for VLP to further promote cross-modal interactions. The core idea of previous masked modeling tasks is to focus on reconstructing the masked tokens based on visible context for learning local-to-local alignment. However, most of them pay little attention to the global semantic features generated for the masked data, resulting in a limited cross-modal alignment ability of global representations. Therefore, in this paper, we propose a novel Semantic Completion Learning (SCL) task, complementary to existing masked modeling tasks, to facilitate global-to-local alignment. Specifically, the SCL task complements the missing semantics of masked data by capturing the corresponding information from the other modality, promoting learning more representative global features which have a great impact on the performance of downstream tasks. Moreover, we present a flexible vision encoder, which enables our model to perform image-text and video-text multimodal tasks simultaneously. Experimental results show that our proposed method obtains state-of-the-art performance on various vision-language benchmarks, such as visual question answering, image-text retrieval, and video-text retrieval.

updated: Sun Mar 26 2023 13:59:36 GMT+0000 (UTC)

published: Thu Nov 24 2022 06:39:16 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト