VLP: A Survey on Vision-Language Pre-training

Feilong Chen; Duzhen Zhang; Minglun Han; Xiuyi Chen; Jing Shi; Shuang Xu; Bo Xu

VLP：視覚言語の事前トレーニングに関する調査

過去数年間で、事前トレーニングモデルの出現により、コンピュータービジョン（CV）や自然言語処理（NLP）などのユニモーダルフィールドが新しい時代にもたらされました。実質的な作業は、それらが下流のユニモーダルタスクに有益であり、新しいモデルを最初からトレーニングすることを回避することを示しています。では、そのような事前トレーニングされたモデルをマルチモーダルタスクに適用できますか？研究者たちはこの問題を調査し、大きな進歩を遂げました。このペーパーでは、画像テキストとビデオテキストの事前トレーニングを含む、視覚言語事前トレーニング（VLP）の最近の進歩と新しいフロンティアを調査します。読者にVLPの全体的な理解を深めるために、まず、特徴抽出、モデルアーキテクチャ、事前トレーニングの目的、事前トレーニングデータセット、およびダウンストリームタスクの5つの側面から最近の進歩を確認します。次に、特定のVLPモデルを詳細に要約します。最後に、VLPの新しいフロンティアについて説明します。私たちの知る限り、これはVLPに関する最初の調査です。この調査がVLP分野の将来の研究に光を当てることを願っています。

In the past few years, the emergence of pre-training models has brought uni-modal fields such as computer vision (CV) and natural language processing (NLP) to a new era. Substantial works have shown they are beneficial for downstream uni-modal tasks and avoid training a new model from scratch. So can such pre-trained models be applied to multi-modal tasks? Researchers have explored this problem and made significant progress. This paper surveys recent advances and new frontiers in vision-language pre-training (VLP), including image-text and video-text pre-training. To give readers a better overall grasp of VLP, we first review its recent advances from five aspects: feature extraction, model architecture, pre-training objectives, pre-training datasets, and downstream tasks. Then, we summarize the specific VLP models in detail. Finally, we discuss the new frontiers in VLP. To the best of our knowledge, this is the first survey on VLP. We hope that this survey can shed light on future research in the VLP field.

updated: Mon Jun 20 2022 08:06:36 GMT+0000 (UTC)

published: Fri Feb 18 2022 07:54:02 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト