Self-Annotated Training for Controllable Image Captioning

Zhangzi Zhu; Tianlei Wang; Hong Qu

制御可能な画像キャプションのための自己注釈付きトレーニング

Controllable Image Captioning（CIC）タスクは、指定された制御信号を条件とするキャプションを生成することを目的としています。この論文では、2つの側面からCICを改善します。1）精度ベースの報酬が意味構造ではなくコンテンツに主に焦点を当てているため、既存の強化トレーニング方法は構造関連のCICモデルには適用できません。強化トレーニングがないため、モデルはより正確で制御可能な文を生成できません。上記の問題を解決するために、構造関連CICモデルの新しい強化トレーニング方法を提案します。自己注釈付きトレーニング（SAT）。再帰的サンプリングメカニズム（RSM）は、入力制御信号を実際の出力文と一致させるように設計されています。。 MSCOCOで実施された広範な実験では、SATメソッドがCIDEr-DスコアのC-Transformer（XE）を長さ制御タスクで118.6から130.1に、時制制御タスクで132.2から142.7に改善し、99％以上を維持することが示されています。制御信号とのマッチング精度。 2）新しい制御信号である文の品質を導入します。それを装備したCICモデルは、必要に応じてさまざまな品質レベルのキャプションを生成できます。実験によると、グラウンドトゥルースキャプションの追加情報がない場合、最高レベルの文の品質によって制御されるモデルは、ベースラインモデルよりもはるかに精度が高くなります。

The Controllable Image Captioning (CIC) task aims to generate captions conditioned on designated control signals. In this paper, we improve CIC from two aspects: 1) Existing reinforcement training methods are not applicable to structure-related CIC models due to the fact that the accuracy-based reward focuses mainly on contents rather than semantic structures. The lack of reinforcement training prevents the model from generating more accurate and controllable sentences. To solve the problem above, we propose a novel reinforcement training method for structure-related CIC models: Self-Annotated Training (SAT), where a recursive sampling mechanism (RSM) is designed to force the input control signal to match the actual output sentence. Extensive experiments conducted on MSCOCO show that our SAT method improves C-Transformer (XE) on CIDEr-D score from 118.6 to 130.1 in the length-control task and from 132.2 to 142.7 in the tense-control task, while maintaining more than 99% matching accuracy with the control signal. 2) We introduce a new control signal: sentence quality. Equipped with it, CIC models are able to generate captions of different quality levels as needed. Experiments show that without additional information of ground truth captions, models controlled by the highest level of sentence quality perform much better in accuracy than baseline models.

updated: Sat Oct 16 2021 02:10:23 GMT+0000 (UTC)

published: Sat Oct 16 2021 02:10:23 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト