ReFormer: The Relational Transformer for Image Captioning

Xuewen Yang; Yingru Liu; Xin Wang

ReFormer：画像キャプション用のリレーショナルトランスフォーマー

画像のキャプションは、シーングラフを使用して画像内のオブジェクトの関係を表すことにより、パフォーマンスを向上させることができることが示されています。現在のキャプションエンコーダーは、通常、グラフ畳み込みネット（GCN）を使用して関係情報を表し、連結または畳み込みを介してオブジェクト領域の特徴とマージして、文のデコードの最終入力を取得します。ただし、既存の方法のGCNベースのエンコーダーは、2つの理由により、キャプションの効果が低くなります。まず、関係中心の損失ではなく、目的として画像のキャプション（つまり、最尤推定）を使用しても、エンコーダーの可能性を完全に調査することはできません。第二に、関係を抽出するためにエンコーダー自体の代わりに事前にトレーニングされたモデルを使用することは柔軟性がなく、モデルの説明可能性に貢献することはできません。画像のキャプションの品質を向上させるために、新しいアーキテクチャReFormerを提案します。これは、関係情報が埋め込まれたフィーチャを生成し、画像内のオブジェクト間のペアワイズ関係を明示的に表現するリレーショナルトランスフォーマーです。 ReFormerは、1つの変更されたTransformerモデルを使用して、シーングラフ生成の目的と画像キャプションの目的を組み込んでいます。この設計により、ReFormerは、強力な関係画像の特徴を抽出するという利点を備えたより優れた画像キャプションを生成できるだけでなく、ペアワイズ関係を明示的に説明するシーングラフも生成できます。公開されているデータセットでの実験は、私たちのモデルが画像のキャプションとシーングラフの生成に関する最先端の方法を大幅に上回っていることを示しています

Image captioning is shown to be able to achieve a better performance by using scene graphs to represent the relations of objects in the image. The current captioning encoders generally use a Graph Convolutional Net (GCN) to represent the relation information and merge it with the object region features via concatenation or convolution to get the final input for sentence decoding. However, the GCN-based encoders in the existing methods are less effective for captioning due to two reasons. First, using the image captioning as the objective (i.e., Maximum Likelihood Estimation) rather than a relation-centric loss cannot fully explore the potential of the encoder. Second, using a pre-trained model instead of the encoder itself to extract the relationships is not flexible and cannot contribute to the explainability of the model. To improve the quality of image captioning, we propose a novel architecture ReFormer -- a RElational transFORMER to generate features with relation information embedded and to explicitly express the pair-wise relationships between objects in the image. ReFormer incorporates the objective of scene graph generation with that of image captioning using one modified Transformer model. This design allows ReFormer to generate not only better image captions with the bene-fit of extracting strong relational image features, but also scene graphs to explicitly describe the pair-wise relation-ships. Experiments on publicly available datasets show that our model significantly outperforms state-of-the-art methods on image captioning and scene graph generation

updated: Thu Jul 29 2021 17:03:36 GMT+0000 (UTC)

published: Thu Jul 29 2021 17:03:36 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト