Large-Scale Bidirectional Training for Zero-Shot Image Captioning

Taehoon Kim; Mark Marsden; Pyunghwan Ahn; Sangyun Kim; Sihaeng Lee; Alessandra Sala; Seung Hwan Kim

ゼロショット画像キャプションの大規模双方向トレーニング

大規模なデータセットでトレーニングすると、画像キャプションモデルは一般的なドメインの画像の内容を理解できますが、正確で詳細なキャプションを生成できないことがよくあります。パフォーマンスを向上させるために、事前トレーニングと微調整が画像キャプションの重要な戦略となっています。ただし、画像とテキストの間の大規模な双方向トレーニングにより、ゼロショットの画像キャプションが可能になることがわかりました。このホワイトペーパーでは、ゼロショット画像キャプションのための効率的なトレーニングおよび推論フレームワークである Bidirectional Image Text Training in largeer Scale、BITTERS を紹介します。また、ゼロショットキャプションの精度と社会的偏見を適切に評価するために、高品質のデータセットと広範なメトリックセットで構成される新しい評価ベンチマークを提案します。さらに、キーワード抽出のための効率的な微調整アプローチも提供します。大規模なトレーニングセットとモデルアーキテクチャを慎重に選択することが、ゼロショットの画像キャプションを実現するための鍵であることを示します。

When trained on large-scale datasets, image captioning models can understand the content of images from a general domain but often fail to generate accurate, detailed captions. To improve performance, pretraining-and-finetuning has been a key strategy for image captioning. However, we find that large-scale bidirectional training between image and text enables zero-shot image captioning. In this paper, we introduce Bidirectional Image Text Training in largER Scale, BITTERS, an efficient training and inference framework for zero-shot image captioning. We also propose a new evaluation benchmark which comprises of high quality datasets and an extensive set of metrics to properly evaluate zero-shot captioning accuracy and societal bias. We additionally provide an efficient finetuning approach for keyword extraction. We show that careful selection of large-scale training set and model architecture is the key to achieving zero-shot image captioning.

updated: Sun Nov 13 2022 00:09:36 GMT+0000 (UTC)

published: Sun Nov 13 2022 00:09:36 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト