Exploiting Cross-Modal Prediction and Relation Consistency for Semi-Supervised Image Captioning

Yang Yang; Hongchen Wei; Hengshu Zhu; Dianhai Yu; Hui Xiong; Qingshan Liu; Jian Yang

半教師あり画像キャプションのためのクロスモーダル予測と関係の一貫性の活用

画像キャプションのタスクは、自動的に学習されたクロスモーダルジェネレータを介して画像から直接キャプションを生成することを目的としています。高性能のジェネレーターを構築するために、既存のアプローチでは通常、多数の記述された画像が必要であり、手動のラベル付けに大きな影響を与える必要があります。ただし、実際のアプリケーションでは、より一般的なシナリオは、記述された画像の量が限られており、記述されていない画像が多数あることです。したがって、結果として生じる課題は、記述されていない画像をクロスモーダルジェネレータの学習に効果的に組み合わせる方法です。この問題を解決するために、クロスモーダル予測と関係の一貫性（CPRC）を活用することにより、新しい画像キャプション手法を提案します。これは、生の画像入力を利用して、生成された文を一般的な意味空間に制約することを目的としています。詳細には、モダリティ間の不均一なギャップが常にグローバル埋め込みを直接使用することの監督の難しさにつながることを考慮して、CPRCは生の画像と対応する生成された文の両方を共有セマンティック空間に変換し、生成された文を2つの側面から測定します。 1）予測の一貫性。 CPRCは、従来の疑似ラベリングを採用するのではなく、生の画像の予測をソフトラベルとして利用して、生成された文の有用な監視を抽出します。 2）関係の一貫性。 CPRCは、重要な関係知識を保持するために、拡張画像と対応する生成された文の間の新しい関係の一貫性を開発します。その結果、CPRCは、有益性と代表性の両方の観点から生成された文を監視し、半教師ありシナリオの下でより効果的なジェネレーターを学習するために、説明されていない画像を合理的に使用できます。

The task of image captioning aims to generate captions directly from images via the automatically learned cross-modal generator. To build a well-performing generator, existing approaches usually need a large number of described images, which requires a huge effects on manual labeling. However, in real-world applications, a more general scenario is that we only have limited amount of described images and a large number of undescribed images. Therefore, a resulting challenge is how to effectively combine the undescribed images into the learning of cross-modal generator. To solve this problem, we propose a novel image captioning method by exploiting the Cross-modal Prediction and Relation Consistency (CPRC), which aims to utilize the raw image input to constrain the generated sentence in the commonly semantic space. In detail, considering that the heterogeneous gap between modalities always leads to the supervision difficulty of using the global embedding directly, CPRC turns to transform both the raw image and corresponding generated sentence into the shared semantic space, and measure the generated sentence from two aspects: 1) Prediction consistency. CPRC utilizes the prediction of raw image as soft label to distill useful supervision for the generated sentence, rather than employing the traditional pseudo labeling; 2) Relation consistency. CPRC develops a novel relation consistency between augmented images and corresponding generated sentences to retain the important relational knowledge. In result, CPRC supervises the generated sentence from both the informativeness and representativeness perspectives, and can reasonably use the undescribed images to learn a more effective generator under the semi-supervised scenario.

updated: Fri Oct 22 2021 13:14:32 GMT+0000 (UTC)

published: Fri Oct 22 2021 13:14:32 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト