OSIC: A New One-Stage Image Captioner Coined

Bo Wang; Zhao Zhang; Mingbo Zhao; Xiaojie Jin; Mingliang Xu; Meng Wang

OSIC: 新しいワンステージ画像キャプショナーの造語

主流の画像キャプションモデルは通常、2 段階のキャプションモデルです。つまり、事前トレーニング済みの検出器によってオブジェクトの特徴を計算し、それを言語モデルに入力してテキストの説明を生成します。ただし、このような操作により、タスクベースの情報ギャップが発生してパフォーマンスが低下します。これは、検出タスクのオブジェクトの特徴が次善の表現であり、後続のテキスト生成に必要なすべての情報を提供できないためです。さらに、オブジェクトの特徴は通常、入力画像の局所的な詳細を失う最後のレイヤーの特徴によって表されます。この論文では、入力画像を一段階で説明的な文章に直接変換する、動的なマルチサイト学習を備えた新しい一段階画像キャプショナー (OSIC) を提案します。その結果、タスクベースの情報ギャップを大幅に削減できます。豊富な機能を取得するために、Swin Transformer を使用してマルチレベル機能を計算し、それらを新しい動的マルチサイト埋め込みモジュールにフィードして、入力画像のグローバル構造とローカルテクスチャの両方を活用します。キャプションのエンコーダーのグローバルモデリングを強化するために、埋め込み機能の相互作用を非ローカルにモデル化する新しい 2 次元リファインモジュールを提案します。最後に、OSIC は豊富で有用な情報を取得して、画像キャプションタスクを改善できます。ベンチマーク MS-COCO データセットでの広範な比較により、この方法の優れたパフォーマンスが検証されました。

Mainstream image caption models are usually two-stage captioners, i.e., calculating object features by pre-trained detector, and feeding them into a language model to generate text descriptions. However, such an operation will cause a task-based information gap to decrease the performance, since the object features in detection task are suboptimal representation and cannot provide all necessary information for subsequent text generation. Besides, object features are usually represented by the last layer features that lose the local details of input images. In this paper, we propose a novel One-Stage Image Captioner (OSIC) with dynamic multi-sight learning, which directly transforms input image into descriptive sentences in one stage. As a result, the task-based information gap can be greatly reduced. To obtain rich features, we use the Swin Transformer to calculate multi-level features, and then feed them into a novel dynamic multi-sight embedding module to exploit both global structure and local texture of input images. To enhance the global modeling of encoder for caption, we propose a new dual-dimensional refining module to non-locally model the interaction of the embedded features. Finally, OSIC can obtain rich and useful information to improve the image caption task. Extensive comparisons on benchmark MS-COCO dataset verified the superior performance of our method.

updated: Fri Nov 04 2022 08:50:09 GMT+0000 (UTC)

published: Fri Nov 04 2022 08:50:09 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト