3M: Multi-style image caption generation using Multi-modality features under Multi-UPDOWN model

Chengxi Li; Brent Harrison

3M：Multi-UPDOWNモデルでのマルチモダリティ機能を使用したマルチスタイルの画像キャプション生成

この論文では、マルチモダリティ画像機能、ResNeXt機能、およびDenseCapによって生成されたテキスト機能を使用するスタイリッシュな画像キャプションのマルチスタイル生成モデルを構築します。マルチモダリティ機能をエンコードしてキャプションにデコードするMulti-UPDOWNキャプションモデルである3Mモデルを提案します。 PERSONALITY-CAPTIONSデータセットとFlickrStyle10Kデータセットの2つのデータセットでのパフォーマンスを調べることにより、人間のようなキャプションの生成に対するモデルの有効性を示します。 BLEU、ROUGE-L、CIDEr、SPICEなどのさまざまな自動NLPメトリックについて、さまざまな最先端のベースラインと比較します。3Mモデルを生成に使用できることを確認するために、定性的調査も行われました。さまざまな定性化されたキャプション。

In this paper, we build a multi-style generative model for stylish image captioning which uses multi-modality image features, ResNeXt features and text features generated by DenseCap. We propose the 3M model, a Multi-UPDOWN caption model that encodes multi-modality features and decode them to captions. We demonstrate the effectiveness of our model on generating human-like captions by examining its performance on two datasets, the PERSONALITY-CAPTIONS dataset and the FlickrStyle10K dataset. We compare against a variety of state-of-the-art baselines on various automatic NLP metrics such as BLEU, ROUGE-L, CIDEr, SPICE, etc. A qualitative study has also been done to verify our 3M model can be used for generating different stylized captions.

updated: Sat Mar 20 2021 14:12:13 GMT+0000 (UTC)

published: Sat Mar 20 2021 14:12:13 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト