VAuLT: Augmenting the Vision-and-Language Transformer for Sentiment Classification on Social Media

Georgios Chochlakis; Tejas Srinivasan; Jesse Thomason; Shrikanth Narayanan

VAuLT: ソーシャルメディアでの感情分類のための視覚と言語のトランスフォーマーの拡張

ビジョンと拡張言語のトランスフォーマー (VAuLT) を提案します。 VAuLT は、人気のある Vision-and-Language Transformer (ViLT) の拡張であり、トレーニングと推論の効率への影響を最小限に抑えながら、画像キャプションよりも複雑なテキスト入力を伴う Vision-and-Language (VL) タスクのパフォーマンスを向上させます。重要なことに、ViLT は、オブジェクト検出器の代わりにパッチの線形投影を使用して画像をエンコードすることにより、VL タスクで効率的なトレーニングと推論を可能にします。ただし、言語入力が単純でリテラルで説明的であるため、言語の多様性に欠けているキャプションデータセットで事前トレーニングされています。そのため、マルチモーダルソーシャルメディアデータなど、実際のマルチメディアデータを扱う場合、言語データのキャプションからの顕著な変化と、タスクの多様性が見られます。実際、ViLT の言語能力が不足しているという証拠が見つかりました。 VAuLT の重要な洞察と新規性は、BERT のような大規模言語モデル (LM) の出力表現を ViLT の言語入力に伝播することです。 LM と ViLT の共同トレーニングにより、ViLT よりも最大 20% の相対的な改善が得られ、より豊富な言語入力とターゲット指向の感情などの感情的構造を含む VL タスクで最先端または同等のパフォーマンスを達成できることを示します。 TWITTER-2015 および TWITTER-2017 での分類、および MVSA-Single および MVSA-Multiple での感情分類。コードは https://github.com/gchochla/VAuLT で入手できます。

We propose the Vision-and-Augmented-Language Transformer (VAuLT). VAuLT is an extension of the popular Vision-and-Language Transformer (ViLT), and improves performance on vision-and-language (VL) tasks that involve more complex text inputs than image captions while having minimal impact on training and inference efficiency. ViLT, importantly, enables efficient training and inference in VL tasks, achieved by encoding images using a linear projection of patches instead of an object detector. However, it is pretrained on captioning datasets, where the language input is simple, literal, and descriptive, therefore lacking linguistic diversity. So, when working with multimedia data in the wild, such as multimodal social media data, there is a notable shift from captioning language data, as well as diversity of tasks. We indeed find evidence that the language capacity of ViLT is lacking. The key insight and novelty of VAuLT is to propagate the output representations of a large language model (LM) like BERT to the language input of ViLT. We show that joint training of the LM and ViLT can yield relative improvements up to 20% over ViLT and achieve state-of-the-art or comparable performance on VL tasks involving richer language inputs and affective constructs, such as for Target-Oriented Sentiment Classification in TWITTER-2015 and TWITTER-2017, and Sentiment Classification in MVSA-Single and MVSA-Multiple. Our code is available at https://github.com/gchochla/VAuLT.

updated: Wed Jan 25 2023 22:48:29 GMT+0000 (UTC)

published: Thu Aug 18 2022 18:51:13 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト