Denoising Large-Scale Image Captioning from Alt-text Data using Content Selection Models

Khyathi Raghavi Chandu; Piyush Sharma; Soravit Changpinyo; Ashish Thapliyal; Radu Soricut

コンテンツ選択モデルを使用した代替テキストデータからの大規模画像キャプションのノイズ除去

大規模な画像キャプション（IC）モデルのトレーニングでは、多くの場合ノイズの多い代替テキストデータから、野生から収集された豊富で多様なトレーニング例のセットにアクセスする必要があります。ただし、ICへの最近のモデリングアプローチは、（ノイズの多い代替テキストベースの注釈ではなく）クリーンな注釈付きデータセットを想定し、エンドツーエンドの生成アプローチを採用しているため、この場合のパフォーマンスの点で不十分なことがよくあります。、制御性と解釈可能性の両方を欠いていることがよくあります。タスクを2つのより単純で制御可能なタスク、つまりスケルトン予測とスケルトンベースのキャプション生成に分割することで、これらの問題に対処します。具体的には、コンテンツワードをスケルトンとして選択する}が、リッチでありながらノイズの多い代替テキストベースのキュレートされていないデータセットを活用するときに、改善されたノイズ除去されたキャプションの生成に役立つことを示します。また、予測された英語のスケルトンをさらに言語を超えて活用して英語以外のキャプションを生成できることを示し、フランス語、イタリア語、ドイツ語、スペイン語、ヒンディー語でのキャプション生成をカバーする実験結果を示します。また、スケルトンベースの予測により、長さ、コンテンツ、性別表現などの特定のキャプションプロパティをより適切に制御できることを示し、ヒューマンインザループの半自動修正を実行するためのハンドルを提供します。

Training large-scale image captioning (IC) models demands access to a rich and diverse set of training examples, gathered from the wild, often from noisy alt-text data. However, recent modeling approaches to IC often fall short in terms of performance in this case, because they assume a clean annotated dataset (as opposed to the noisier alt-text--based annotations), and employ an end-to-end generation approach, which often lacks both controllability and interpretability. We address these problems by breaking down the task into two simpler, more controllable tasks -- skeleton prediction and skeleton-based caption generation. Specifically, we show that selecting content words as skeletons} helps in generating improved and denoised captions when leveraging rich yet noisy alt-text--based uncurated datasets. We also show that the predicted English skeletons can be further cross-lingually leveraged to generate non-English captions, and present experimental results covering caption generation in French, Italian, German, Spanish and Hindi. We also show that skeleton-based prediction allows for better control of certain caption properties, such as length, content, and gender expression, providing a handle to perform human-in-the-loop semi-automatic corrections.

updated: Fri Apr 16 2021 23:11:48 GMT+0000 (UTC)

published: Thu Sep 10 2020 23:31:38 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト