One-shot Scene Graph Generation

Yuyu Guo; Jingkuan Song; Lianli Gao; Heng Tao Shen

ワンショットシーングラフの生成

画像コンテンツの構造化された表現として、視覚シーングラフ（視覚的関係）は、コンピュータービジョンと自然言語処理の間の架け橋として機能します。シーングラフ生成タスクの既存のモデルは、数十または数百のラベル付きサンプルを必要とすることで有名です。対照的に、人間はいくつかの例または1つの例から視覚的な関係を学ぶことができます。これに触発されて、One-Shot Scene Graph Generationという名前のタスクを設計します。このタスクでは、各関係のトリプレット（たとえば、「dog-has-head」）は、ラベル付けされた1つの例のみから取得されます。重要な洞察は、ゼロから学ぶのではなく、豊富な事前知識を活用できるということです。本論文では、ワンショットシーングラフ生成課題のための複数構造化知識（関係知識と常識知識）を提案する。具体的には、関係知識は、視覚的コンテンツから抽出されたエンティティ間の関係に関する事前知識を表します。たとえば、「犬」と「庭」の間に「立っている」、「座っている」、「横たわっている」という視覚的な関係が存在する場合があります。一方、常識知識は「犬は庭を守ることができる」のような「センスメイキング」の知識をエンコードします。これらの2種類の知識をグラフ構造に編成することにより、グラフ畳み込みネットワーク（GCN）を使用して、エンティティの知識が埋め込まれたセマンティック機能を抽出します。さらに、Faster R-CNNによって生成された各エンティティから分離された視覚的特徴を抽出する代わりに、Instance RelationTransformerエンコーダーを使用してコンテキスト情報を完全に調査します。構築されたワンショットデータセットに基づいて、実験結果は、私たちの方法が既存の最先端の方法を大幅に上回っていることを示しています。アブレーションの調査では、Instance RelationTransformerエンコーダーとMultipleStructuredKnowledgeの有効性も検証されています。

As a structured representation of the image content, the visual scene graph (visual relationship) acts as a bridge between computer vision and natural language processing. Existing models on the scene graph generation task notoriously require tens or hundreds of labeled samples. By contrast, human beings can learn visual relationships from a few or even one example. Inspired by this, we design a task named One-Shot Scene Graph Generation, where each relationship triplet (e.g., "dog-has-head") comes from only one labeled example. The key insight is that rather than learning from scratch, one can utilize rich prior knowledge. In this paper, we propose Multiple Structured Knowledge (Relational Knowledge and Commonsense Knowledge) for the one-shot scene graph generation task. Specifically, the Relational Knowledge represents the prior knowledge of relationships between entities extracted from the visual content, e.g., the visual relationships "standing in", "sitting in", and "lying in" may exist between "dog" and "yard", while the Commonsense Knowledge encodes "sense-making" knowledge like "dog can guard yard". By organizing these two kinds of knowledge in a graph structure, Graph Convolution Networks (GCNs) are used to extract knowledge-embedded semantic features of the entities. Besides, instead of extracting isolated visual features from each entity generated by Faster R-CNN, we utilize an Instance Relation Transformer encoder to fully explore their context information. Based on a constructed one-shot dataset, the experimental results show that our method significantly outperforms existing state-of-the-art methods by a large margin. Ablation studies also verify the effectiveness of the Instance Relation Transformer encoder and the Multiple Structured Knowledge.

updated: Tue Feb 22 2022 11:32:59 GMT+0000 (UTC)

published: Tue Feb 22 2022 11:32:59 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト