GIT: A Generative Image-to-text Transformer for Vision and Language

Jianfeng Wang; Zhengyuan Yang; Xiaowei Hu; Linjie Li; Kevin Lin; Zhe Gan; Zicheng Liu; Ce Liu; Lijuan Wang

GIT: 視覚と言語のための生成的な画像からテキストへのトランスフォーマー

このホワイトペーパーでは、Generative Image-to-Text Transformer (GIT) を設計およびトレーニングして、画像/ビデオのキャプションや質問応答などの視覚言語タスクを統合します。生成モデルは事前トレーニングと微調整の間で一貫したネットワークアーキテクチャを提供しますが、既存の作業には通常、複雑な構造 (ユニ/マルチモーダルエンコーダー/デコーダー) が含まれており、オブジェクト検出器/タガーや光学式文字認識 (OCR) などの外部モジュールに依存しています。）。 GIT では、単一の言語モデリングタスクの下で、1 つの画像エンコーダーと 1 つのテキストデコーダーとしてアーキテクチャを簡素化します。また、事前トレーニングデータとモデルサイズをスケールアップして、モデルのパフォーマンスを向上させます。付加機能がなければ、当社の GIT は 12 の挑戦的なベンチマークで新しい最先端技術を確立し、大きな差をつけています。たとえば、私たちのモデルは TextCaps で初めて人間のパフォーマンスを上回りました (CIDEr で 138.2 対 125.5)。さらに、世代ベースの画像分類とシーンテキスト認識の新しいスキームを提示し、標準ベンチマークでまともなパフォーマンスを達成します。コードは https://github.com/microsoft/GenerativeImage2Text で公開されています。

In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering. While generative models provide a consistent network architecture between pre-training and fine-tuning, existing work typically contains complex structures (uni/multi-modal encoder/decoder) and depends on external modules such as object detectors/taggers and optical character recognition (OCR). In GIT, we simplify the architecture as one image encoder and one text decoder under a single language modeling task. We also scale up the pre-training data and the model size to boost the model performance. Without bells and whistles, our GIT establishes new state of the arts on 12 challenging benchmarks with a large margin. For instance, our model surpasses the human performance for the first time on TextCaps (138.2 vs. 125.5 in CIDEr). Furthermore, we present a new scheme of generation-based image classification and scene text recognition, achieving decent performance on standard benchmarks. Codes are released at https://github.com/microsoft/GenerativeImage2Text.

updated: Mon Aug 22 2022 17:42:41 GMT+0000 (UTC)

published: Fri May 27 2022 17:03:38 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト