Similar Scenes arouse Similar Emotions: Parallel Data Augmentation for Stylized Image Captioning

Guodun Li; Yuchen Zhai; Zehao Lin; Yin Zhang

類似のシーンが類似の感情を呼び起こす：定型化された画像キャプションのための並列データ拡張

定型化された画像キャプションシステムは、特定の画像に意味的に関連するだけでなく、特定のスタイルの説明と一致するキャプションを生成することを目的としています。このタスクの最大の課題の1つは、十分なペアの定型化されたデータがないことです。多くの研究は、データ拡張の観点から考慮せずに、教師なしアプローチに焦点を合わせています。まず、人々が同じようなシーンにいるときに同じような感情を思い出し、同じようなスタイルのフレーズで同じような感情を表現することがよくあるという観察から始めます。これは、データ拡張のアイデアを支えています。この論文では、小規模な定型化された文からスタイルフレーズを抽出し、それらを大規模な事実のキャプションに移植するための、新しいExtract-Retrieve-Generateデータ拡張フレームワークを提案します。まず、小規模な定型化された文からスタイルフレーズを抽出するための感情信号抽出器を設計します。次に、プラグイン可能なマルチモーダルシーンレトリーバーを構築して、画像とその定型化されたキャプションのペアで表されるシーンを取得します。これは、大規模なファクトデータのクエリ画像またはキャプションに類似しています。最後に、類似したシーンのスタイルフレーズと現在のシーンの事実の説明に基づいて、感情を意識したキャプションジェネレーターを構築し、現在のシーンの流暢で多様な定型化されたキャプションを生成します。広範な実験結果は、私たちのフレームワークがデータ不足の問題を効果的に軽減できることを示しています。また、教師あり設定と教師なし設定の両方で、いくつかの既存の画像キャプションモデルのパフォーマンスを大幅に向上させます。これは、文の関連性とスタイリッシュさの両方の点で、最先端の定型化された画像キャプション方法を大幅に上回ります。

Stylized image captioning systems aim to generate a caption not only semantically related to a given image but also consistent with a given style description. One of the biggest challenges with this task is the lack of sufficient paired stylized data. Many studies focus on unsupervised approaches, without considering from the perspective of data augmentation. We begin with the observation that people may recall similar emotions when they are in similar scenes, and often express similar emotions with similar style phrases, which underpins our data augmentation idea. In this paper, we propose a novel Extract-Retrieve-Generate data augmentation framework to extract style phrases from small-scale stylized sentences and graft them to large-scale factual captions. First, we design the emotional signal extractor to extract style phrases from small-scale stylized sentences. Second, we construct the plugable multi-modal scene retriever to retrieve scenes represented with pairs of an image and its stylized caption, which are similar to the query image or caption in the large-scale factual data. In the end, based on the style phrases of similar scenes and the factual description of the current scene, we build the emotion-aware caption generator to generate fluent and diversified stylized captions for the current scene. Extensive experimental results show that our framework can alleviate the data scarcity problem effectively. It also significantly boosts the performance of several existing image captioning models in both supervised and unsupervised settings, which outperforms the state-of-the-art stylized image captioning methods in terms of both sentence relevance and stylishness by a substantial margin.

updated: Thu Aug 26 2021 17:08:58 GMT+0000 (UTC)

published: Thu Aug 26 2021 17:08:58 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト