VLP: A Survey on Vision-Language Pre-training

Feilong Chen; Duzhen Zhang; Minglun Han; Xiuyi Chen; Jing Shi; Shuang Xu; Bo Xu

VLP: 視覚言語プレトレーニングに関する調査

過去数年間で、事前トレーニングモデルの出現により、コンピュータービジョン (CV) や自然言語処理 (NLP) などの単一モードの分野が新しい時代に突入しました。実質的な研究により、それらは下流の単一モーダルタスクに有益であり、新しいモデルを最初からトレーニングする必要がないことが示されています。では、そのような事前トレーニング済みのモデルをマルチモーダルタスクに適用できるでしょうか?研究者はこの問題を調査し、大きな進歩を遂げました。このホワイトペーパーでは、画像テキストおよびビデオテキストの事前トレーニングを含む、視覚言語事前トレーニング (VLP) の最近の進歩と新しいフロンティアについて概説します。読者が VLP の全体像をよりよく理解できるように、まず、特徴抽出、モデルアーキテクチャ、トレーニング前の目的、トレーニング前のデータセット、ダウンストリームタスクの 5 つの側面から最近の進歩を確認します。次に、特定の VLP モデルを詳細にまとめます。最後に、VLP の新しいフロンティアについて説明します。私たちの知る限り、これは VLP に焦点を当てた最初の調査です。この調査が、VLP 分野における今後の研究に光を当てることができることを願っています。

In the past few years, the emergence of pre-training models has brought uni-modal fields such as computer vision (CV) and natural language processing (NLP) to a new era. Substantial works have shown they are beneficial for downstream uni-modal tasks and avoid training a new model from scratch. So can such pre-trained models be applied to multi-modal tasks? Researchers have explored this problem and made significant progress. This paper surveys recent advances and new frontiers in vision-language pre-training (VLP), including image-text and video-text pre-training. To give readers a better overall grasp of VLP, we first review its recent advances from five aspects: feature extraction, model architecture, pre-training objectives, pre-training datasets, and downstream tasks. Then, we summarize the specific VLP models in detail. Finally, we discuss the new frontiers in VLP. To the best of our knowledge, this is the first survey focused on VLP. We hope that this survey can shed light on future research in the VLP field.

updated: Sat Jul 30 2022 14:38:11 GMT+0000 (UTC)

published: Fri Feb 18 2022 07:54:02 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト