VAuLT: Augmenting the Vision-and-Language Transformer with the Propagation of Deep Language Representations

Georgios Chochlakis; Tejas Srinivasan; Jesse Thomason; Shrikanth Narayanan

VAuLT: 深い言語表現の伝播による視覚と言語のトランスフォーマーの拡張

ビジョンと拡張言語のトランスフォーマー (VAuLT) を提案します。 VAuLT は、人気のある Vision-and-Language Transformer (ViLT) の拡張であり、トレーニングと推論の効率への影響を最小限に抑えながら、画像キャプションよりも複雑なテキスト入力を伴う視覚と言語のタスクのパフォーマンスを向上させます。重要なことに、ViLT は、浅い画像エンコーダーを使用して達成される、視覚と言語のタスクにおける効率的なトレーニングと推論を可能にします。ただし、キャプションや同様のデータセットで事前トレーニングされているため、言語入力は単純で、文字通り、説明的であるため、言語の多様性に欠けています。そのため、マルチモーダルソーシャルメディアデータ (私たちの研究では Twitter) など、実際のマルチメディアデータを扱う場合、タスクの多様性だけでなく、言語データのキャプションからも顕著な変化が見られます。代わりに ViLT の容量が不足しています。 VAuLT の重要な洞察は、BERT のような大きな言語モデルの出力表現を ViLT の言語入力に伝播することです。このような戦略は、TWITTER-2015、TWITTER-2017、MVSA-Single、MVSA-Multiple などのより豊富な言語入力と感情構造を含む視覚と言語のタスクで ViLT よりも大幅に改善されることを示していますが、そのような純粋な推論タスクには遅れをとっています。ブルームバーグ Twitter のテキストと画像の関係データセットとして。 https://github.com/gchochla/VAuLT ですべての実験のコードをリリースしました。

We propose the Vision-and-Augmented-Language Transformer (VAuLT). VAuLT is an extension of the popular Vision-and-Language Transformer (ViLT), and improves performance on vision-and-language tasks that involve more complex text inputs than image captions while having minimal impact on training and inference efficiency. ViLT, importantly, enables efficient training and inference in vision-and-language tasks, achieved by using a shallow image encoder. However, it is pretrained on captioning and similar datasets, where the language input is simple, literal, and descriptive, therefore lacking linguistic diversity. So, when working with multimedia data in the wild, such as multimodal social media data (in our work, Twitter), there is a notable shift from captioning language data, as well as diversity of tasks, and we indeed find evidence that the language capacity of ViLT is lacking instead. The key insight of VAuLT is to propagate the output representations of a large language model like BERT to the language input of ViLT. We show that such a strategy significantly improves over ViLT on vision-and-language tasks involving richer language inputs and affective constructs, such as TWITTER-2015, TWITTER-2017, MVSA-Single and MVSA-Multiple, but lags behind pure reasoning tasks such as the Bloomberg Twitter Text-Image Relationship dataset. We have released the code for all our experiments at https://github.com/gchochla/VAuLT.

updated: Thu Aug 18 2022 18:51:13 GMT+0000 (UTC)

published: Thu Aug 18 2022 18:51:13 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト