Structured Sparse R-CNN for Direct Scene Graph Generation

Yao Teng; Limin Wang

直接シーングラフ生成のための構造化スパースR-CNN

シーングラフ生成（SGG）は、画像内の関係を持つエンティティペアを検出することです。既存のSGGアプローチでは、多くの場合、多段階パイプラインを使用して、このタスクをオブジェクト検出、関係グラフの構築、および密または密から疎の関係予測に分解します。代わりに、直接セット予測としてのSGGの観点から、このペーパーでは、構造化スパースR-CNNと呼ばれる、関係検出のためのシンプルでスパースで統一されたフレームワークを紹介します。私たちの方法の鍵は、学習可能なトリプレットクエリと構造化されたトリプレット検出器のセットであり、エンドツーエンドの方法でトレーニングセットから共同で最適化できます。具体的には、トリプレットクエリは、エンティティペアの場所、カテゴリ、およびそれらの関係の一般的な事前情報をエンコードし、その後の改良のために関係検出の初期推測を提供します。トリプレット検出器は、関係検出の結果を段階的に改善するためのカスケード動的ヘッド設計を提供します。さらに、Structured Sparse R-CNNのトレーニングの難しさを軽減するために、シャムスパースR-CNNからの知識蒸留に基づくリラックスした強化されたトレーニング戦略を提案します。また、不均衡なデータ分布のための適応フォーカシングパラメータと平均ロジットアプローチを提案します。 VisualGenomeとOpenImagesの2つのベンチマークで実験を行い、その結果は、私たちの方法が最先端のパフォーマンスを達成していることを示しています。一方、トリプレット検出器の設計とトレーニング戦略における構造化モデリングに関する洞察を提供するために、詳細なアブレーション研究を実施しています。

Scene graph generation (SGG) is to detect entity pairs with their relations in an image. Existing SGG approaches often use multi-stage pipelines to decompose this task into object detection, relation graph construction, and dense or dense-to-sparse relation prediction. Instead, from a perspective on SGG as a direct set prediction, this paper presents a simple, sparse, and unified framework for relation detection, termed as Structured Sparse R-CNN. The key to our method is a set of learnable triplet queries and structured triplet detectors which could be jointly optimized from the training set in an end-to-end manner. Specifically, the triplet queries encode the general prior for entity pair locations, categories, and their relations, and provide an initial guess of relation detection for subsequent refinement. The triplet detector presents a cascaded dynamic head design to progressively refine the results of relation detection. In addition, to relieve the training difficulty of Structured Sparse R-CNN, we propose a relaxed and enhanced training strategy based on knowledge distillation from a Siamese Sparse R-CNN. We also propose adaptive focusing parameter and average logit approach for imbalance data distribution. We perform experiments on two benchmarks: Visual Genome and Open Images, and the results demonstrate that our method achieves the state-of-the-art performance. Meanwhile, we perform in-depth ablation studies to provide insights on our structured modeling in triplet detector design and training strategies.

updated: Mon Jun 21 2021 02:24:20 GMT+0000 (UTC)

published: Mon Jun 21 2021 02:24:20 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト