RelTransformer: A Transformer-Based Long-Tail Visual Relationship Recognition

Jun Chen; Aniket Agarwal; Sherif Abdelkarim; Deyao Zhu; Mohamed Elhoseiny

RelTransformer：トランスフォーマーベースのロングテール視覚関係認識

視覚的関係認識（VRR）タスクは、画像内の相互作用するオブジェクト間のペアワイズ視覚的関係を理解することを目的としています。これらの関係は、その構成上の性質により、通常、ロングテール分布になります。この問題は、語彙が大きくなるとさらに深刻になり、このタスクは非常に困難になります。このホワイトペーパーでは、注意メカニズムを介した効果的なメッセージパッシングフローのモデル化が、VRRの構成性とロングテールの課題に取り組むために重要になる可能性があることを示しています。 RelTransformerと呼ばれるこのメソッドは、各画像を完全に接続されたシーングラフとして表し、シーン全体をリレーショントリプレットコンテキストとグローバルシーンコンテキストに再構築します。リレーショントリプレットおよびグローバルシーンコンテキストの各要素からのメッセージを、自己注意を介してターゲットリレーションに直接渡します。また、ロングテール関係表現学習を強化するための学習可能な記憶を設計します。広範な実験を通じて、私たちのモデルは多くのVRRベンチマークでよく一般化されていることがわかりました。私たちのモデルは、2つの大規模なロングテールVRRベンチマークであるVG8K-LT（全体のaccが+ 2.0％）とGQA-LT（全体のaccが+ 26.0％）で最高のパフォーマンスを発揮するモデルを上回り、どちらもテールに向かって非常に偏った分布をしています。。また、VG200関係検出タスクで強力な結果を達成します。私たちのコードはhttps://github.com/Vision-CAIR/RelTransformerで入手できます。

The visual relationship recognition (VRR) task aims at understanding the pairwise visual relationships between interacting objects in an image. These relationships typically have a long-tail distribution due to their compositional nature. This problem gets more severe when the vocabulary becomes large, rendering this task very challenging. This paper shows that modeling an effective message-passing flow through an attention mechanism can be critical to tackling the compositionality and long-tail challenges in VRR. The method, called RelTransformer, represents each image as a fully-connected scene graph and restructures the whole scene into the relation-triplet and global-scene contexts. It directly passes the message from each element in the relation-triplet and global-scene contexts to the target relation via self-attention. We also design a learnable memory to augment the long-tail relation representation learning. Through extensive experiments, we find that our model generalizes well on many VRR benchmarks. Our model outperforms the best-performing models on two large-scale long-tail VRR benchmarks, VG8K-LT (+2.0% overall acc) and GQA-LT (+26.0% overall acc), both having a highly skewed distribution towards the tail. It also achieves strong results on the VG200 relation detection task. Our code is available at https://github.com/Vision-CAIR/RelTransformer.

updated: Tue Mar 29 2022 14:47:44 GMT+0000 (UTC)

published: Sat Apr 24 2021 12:04:04 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト