Scenes and Surroundings: Scene Graph Generation using Relation Transformer

Rajat Koner; Poulami Sinhamahapatra; Volker Tresp

シーンとその周辺：RelationTransformerを使用したシーングラフの生成

画像内のオブジェクトとそれらの相互関係をシーングラフとして識別することで、画像の内容を深く理解することができます。ディープラーニングの最近の進歩にもかかわらず、視覚オブジェクトの関係の検出とラベル付けは依然として困難な作業です。この作業は、複雑なグローバルオブジェクトからオブジェクトおよびオブジェクトからエッジ（関係）の相互作用を活用する、リレーショントランスフォーマーという名前の新しいローカルコンテキストアウェアアーキテクチャを提案します。私たちの階層的なマルチヘッドアテンションベースのアプローチは、オブジェクト間のコンテキスト依存関係を効率的にキャプチャし、それらの関係を予測します。最先端のアプローチと比較して、Visual Genomeデータセットのすべてのシーングラフ生成タスクで、全体の平均4.85％の改善と新しいベンチマークを達成しました。

Identifying objects in an image and their mutual relationships as a scene graph leads to a deep understanding of image content. Despite the recent advancement in deep learning, the detection and labeling of visual object relationships remain a challenging task. This work proposes a novel local-context aware architecture named relation transformer, which exploits complex global objects to object and object to edge (relation) interactions. Our hierarchical multi-head attention-based approach efficiently captures contextual dependencies between objects and predicts their relationships. In comparison to state-of-the-art approaches, we have achieved an overall mean 4.85% improvement and a new benchmark across all the scene graph generation tasks on the Visual Genome dataset.

updated: Mon Jul 12 2021 14:22:20 GMT+0000 (UTC)

published: Mon Jul 12 2021 14:22:20 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト