Supervised Multimodal Bitransformers for Classifying Images and Text

Douwe Kiela; Suvrat Bhooshan; Hamed Firooz; Ethan Perez; Davide Testuggine

画像とテキストを分類するための教師ありマルチモーダルバイトランスフォーマ

BERTなどの自己監視型双方向トランスモデルは、さまざまなテキスト分類タスクの劇的な改善につながりました。しかし、現代のデジタル世界はますますマルチモーダルになっており、テキスト情報には画像などの他のモダリティが伴うことがよくあります。テキストエンコーダーと画像エンコーダーからの情報を融合する教師ありマルチモーダルバイトランスフォーマーモデルを紹介し、マルチモーダルパフォーマンスを測定するために特別に設計されたハードテストセットを含む、強力なベースラインを上回るさまざまなマルチモーダル分類ベンチマークタスクで最先端のパフォーマンスを取得します。

Self-supervised bidirectional transformer models such as BERT have led to dramatic improvements in a wide variety of textual classification tasks. The modern digital world is increasingly multimodal, however, and textual information is often accompanied by other modalities such as images. We introduce a supervised multimodal bitransformer model that fuses information from text and image encoders, and obtain state-of-the-art performance on various multimodal classification benchmark tasks, outperforming strong baselines, including on hard test sets specifically designed to measure multimodal performance.

updated: Thu Nov 12 2020 03:08:28 GMT+0000 (UTC)

published: Fri Sep 06 2019 14:59:18 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト