Dense Relational Captioning: Triple-Stream Networks for   Relationship-Based Captioning

Dong-Jin Kim; Jinsoo Choi; Tae-Hyun Oh; In So Kweon

密なリレーショナルキャプション：関係に基づいたキャプションのためのトリプルストリームネットワーク

Dense Relational Captioning: Triple-Stream Networks for Relationship-Based Captioning

この作業の目標は、より高密度で有益なキャプションを生成する画像キャプションモデルをトレーニングすることです。「リレーショナルキャプション」を導入します。これは、画像内のオブジェクト間の関係情報に関する複数のキャプションを生成することを目的とした、新しい画像キャプションタスクです。リレーショナルキャプションは、情報の多様性と量の両方で有利なフレームワークであり、関係に基づいたイメージの理解につながります。品詞（POS、つまりsubject-object-predicateカテゴリ）タグをすべての英語の単語に割り当てることができます。キャプション内の単語の正しいシーケンスをガイドするための事前としてPOSを活用します。この目的のために、それぞれのPOSの3つのリカレントユニットで構成され、POSの予測とキャプションを共同で実行するマルチタスクトリプルストリームネットワーク（MTTSNet）を提案します。提案されたモデルによって生成された、より多様で豊富な表現を、いくつかのベースラインと競合する方法に対して示します。

Our goal in this work is to train an image captioning model that generates more dense and informative captions. We introduce "relational captioning," a novel image captioning task which aims to generate multiple captions with respect to relational information between objects in an image. Relational captioning is a framework that is advantageous in both diversity and amount of information, leading to image understanding based on relationships. Part-of speech (POS, i.e. subject-object-predicate categories) tags can be assigned to every English word. We leverage the POS as a prior to guide the correct sequence of words in a caption. To this end, we propose a multi-task triple-stream network (MTTSNet) which consists of three recurrent units for the respective POS and jointly performs POS prediction and captioning. We demonstrate more diverse and richer representations generated by the proposed model against several baselines and competing methods.

updated: Sun Sep 22 2019 07:24:51 GMT+0000 (UTC)

published: Thu Mar 14 2019 12:36:01 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト