Vision Language Transformers: A Survey

Clayton Fields; Casey Kennington

ビジョン・ランゲージ・トランスフォーマー: 調査

画像に関する質問に答えたり、画像を説明するキャプションを生成したりする視覚言語タスクは、コンピューターが実行するのが難しいタスクです。比較的最近の研究では、vaswani2017 で紹介された事前トレーニング済みトランスフォーマーアーキテクチャをビジョン言語モデリングに適用しました。 Transformer モデルは、以前のビジョン言語モデルに比べてパフォーマンスと汎用性が大幅に向上しました。これは、大規模な汎用データセットでモデルを事前トレーニングし、アーキテクチャとパラメーター値をわずかに変更して学習を新しいタスクに転送することによって実現されます。このタイプの転移学習は、自然言語処理とコンピュータービジョンの両方における標準的なモデリング手法となっています。視覚言語トランスフォーマーは、視覚と言語の両方を必要とするタスクにおいても同様の進歩をもたらすことが期待されます。この論文では、視覚言語変換モデルに関する現在利用可能な研究を幅広く総合し、その長所、限界、および残された未解決の疑問についていくつかの分析を提供します。

Vision language tasks, such as answering questions about or generating captions that describe an image, are difficult tasks for computers to perform. A relatively recent body of research has adapted the pretrained transformer architecture introduced in vaswani2017attention to vision language modeling. Transformer models have greatly improved performance and versatility over previous vision language models. They do so by pretraining models on a large generic datasets and transferring their learning to new tasks with minor changes in architecture and parameter values. This type of transfer learning has become the standard modeling practice in both natural language processing and computer vision. Vision language transformers offer the promise of producing similar advancements in tasks which require both vision and language. In this paper, we provide a broad synthesis of the currently available research on vision language transformer models and offer some analysis of their strengths, limitations and some open questions that remain.

updated: Thu Jul 06 2023 19:08:56 GMT+0000 (UTC)

published: Thu Jul 06 2023 19:08:56 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト