I-Tuning: Tuning Frozen Language Models with Image for Lightweight Image Captioning

Ziyang Luo; Zhipeng Hu; Yadong Xi; Rongsheng Zhang; Jing Ma

I-Tuning: 軽量な画像キャプション用に画像を使用して凍結された言語モデルを調整する

画像キャプションは、画像の言語記述を生成するための一般的な視覚と言語のタスクです。最近の進歩は、モデルのサイズとトレーニングデータの数を拡大することに重点を置いており、トレーニングのコストが大幅に増加しています。これらの高コストモデルの代替として、少数のトレーニング可能なパラメーターのみを含む軽量の画像キャプションフレームワークである I-Tuning を導入します。新しい I-Tuning クロスアテンションモジュールは、トレーニング不可能な事前トレーニング済み言語デコーダー GPT2 とビジョンエンコーダー CLIP-ViT を接続します。ほとんどのパラメーターはトレーニング中に更新されないため、フレームワークは軽量で高速です。 3 つの画像キャプションベンチマークに関する実験結果は、私たちのフレームワークが大規模なベースラインシステムと同等またはそれ以上のパフォーマンスを達成することを明らかにしています。同時に、モデルに必要なトレーニング可能なパラメータは最大 10 分の 1 であり、トレーニングデータもはるかに少なくて済みます。

Image Captioning is a popular vision-and-language task to generate the language description of an image. Recent advances focus on scaling up the model size and the number of training data, significantly increasing the cost of training. As an alternative to these heavy-cost models, we introduce I-Tuning, a lightweight image captioning framework, which contains only a small number of trainable parameters. The novel I-Tuning cross-attention module connects the non-trainable pre-trained language decoder GPT2 and vision encoder CLIP-ViT. Since most parameters are not updated during training, our framework is lightweight and fast. Experimental results on three image captioning benchmarks reveal that our framework achieves comparable or better performance than the large-scale baseline systems. At the same time, our models require up to 10 times fewer trainable parameters and much fewer training data.

updated: Sat Oct 15 2022 01:05:02 GMT+0000 (UTC)

published: Mon Feb 14 2022 09:36:50 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト