VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning

Jun Chen; Han Guo; Kai Yi; Boyang Li; Mohamed Elhoseiny

VisualGPT：画像キャプションのための事前トレーニングされた言語モデルのデータ効率的な適応

少量のトレーニングデータからすばやく学習できるため、機械学習アプリケーションの範囲が広がります。この論文では、大規模な事前訓練された言語モデル（LM）からの言語知識を活用する、データ効率の高い画像キャプションモデルであるVisualGPTを提案します。重要な課題は、画像内の視覚情報の使用と、事前トレーニングから取得した以前の言語知識とのバランスを取ることです。事前にトレーニングされたLMを、少量のドメイン内トレーニングデータの言語デコーダーとしてすばやく適応させるために、新しい自己復活型エンコーダーデコーダー注意メカニズムを設計しました。提案された自己復活活性化ユニットは、まばらな活性化を生成しますが、ゼロ勾配への感受性が低下しています。提案されたモデルVisualGPTを、MSCOCOおよびConceptual Captionsトレーニングデータの0.1％、0.5％、および1％でトレーニングします。これらの条件下で、MS COCOで最大10.8％CIDEr、概念キャプションで最大5.4％CIDErだけ、最良のベースラインモデルを上回ります。さらに、Visual-GPTは、医療レポート生成データセットであるIUX線で最先端の結果を実現します。私たちの知る限り、これは、ユニモーダルデータで事前トレーニングされたLMを利用することにより、画像キャプションのデータ効率を改善する最初の作業です。コードはhttps://github.com/Vision-CAIR/VisualGPTで入手できます。

The ability to quickly learn from a small quantity oftraining data widens the range of machine learning applications. In this paper, we propose a data-efficient image captioning model, VisualGPT, which leverages the linguistic knowledge from a large pretrained language model(LM). A crucial challenge is to balance between the use of visual information in the image and prior linguistic knowledge acquired from pretraining. We designed a novel self-resurrecting encoder-decoder attention mechanism to quickly adapt the pretrained LM as the language decoder ona small amount of in-domain training data. The proposed self-resurrecting activation unit produces sparse activations but has reduced susceptibility to zero gradients. We train the proposed model, VisualGPT, on 0.1%, 0.5% and 1% of MSCOCO and Conceptual Captions training data. Under these conditions, we outperform the best baseline model by up to 10.8% CIDEr on MS COCO and upto 5.4% CIDEr on Conceptual Captions. Further, Visual-GPT achieves the state-of-the-art result on IU X-ray, a medical report generation dataset. To the best of our knowledge, this is the first work that improves data efficiency of image captioning by utilizing LM pretrained on unimodal data. Our code is available at: https://github.com/Vision-CAIR/VisualGPT.

updated: Sat Apr 17 2021 07:14:40 GMT+0000 (UTC)

published: Sat Feb 20 2021 18:02:42 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト