Dense Relational Image Captioning via Multi-task Triple-Stream Networks

Dong-Jin Kim; Tae-Hyun Oh; Jinsoo Choi; In So Kweon

マルチタスクトリプルストリームネットワークを介した高密度のリレーショナル画像キャプション

視覚シーン内のオブジェクト間の関係情報に関して複数のキャプションを生成することを目的とした、新しい画像キャプションタスクである高密度リレーショナルキャプションを紹介します。リレーショナルキャプションは、オブジェクトの組み合わせ間の各関係を明示的に説明します。このフレームワークは、情報の多様性と量の両方で有利であり、関係に基づく包括的な画像理解、たとえば、関係提案の生成につながります。オブジェクト間の関係を理解するために、品詞（POS、つまり、主語-目的語-述語のカテゴリ）は、キャプション内の単語の原因となるシーケンスをガイドするための貴重な事前情報になります。私たちは、キャプションを生成するだけでなく、各単語のPOSを理解することを学ぶためにフレームワークを実施します。この目的のために、各単語の正しいキャプションとPOSを共同で予測することによってトレーニングされる、各POSを担当する3つの回帰ユニットで構成されるマルチタスクトリプルストリームネットワーク（MTTSNet）を提案します。さらに、MTTSNetのパフォーマンスは、明示的なリレーショナルモジュールを使用してオブジェクトの埋め込みを変調することで改善できることがわかりました。大規模なデータセットといくつかのメトリックに関する広範な実験的分析を通じて、提案されたモデルがより多様で豊富なキャプションを生成できることを示します。次に、全体的な画像キャプション、シーングラフの生成、および検索タスクへのフレームワークのアプリケーションを紹介します。

We introduce dense relational captioning, a novel image captioning task which aims to generate multiple captions with respect to relational information between objects in a visual scene. Relational captioning provides explicit descriptions for each relationship between object combinations. This framework is advantageous in both diversity and amount of information, leading to a comprehensive image understanding based on relationships, e.g., relational proposal generation. For relational understanding between objects, the part-of-speech (POS; i.e., subject-object-predicate categories) can be a valuable prior information to guide the causal sequence of words in a caption. We enforce our framework to learn not only to generate captions but also to understand the POS of each word. To this end, we propose the multi-task triple-stream network (MTTSNet) which consists of three recurrent units responsible for each POS which is trained by jointly predicting the correct captions and POS for each word. In addition, we found that the performance of MTTSNet can be improved by modulating the object embeddings with an explicit relational module. We demonstrate that our proposed model can generate more diverse and richer captions, via extensive experimental analysis on large scale datasets and several metrics. Then, we present applications of our framework to holistic image captioning, scene graph generation, and retrieval tasks.

updated: Mon Oct 11 2021 08:49:57 GMT+0000 (UTC)

published: Thu Oct 08 2020 09:17:55 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト