From General to Specific: Informative Scene Graph Generation via Balance Adjustment

Yuyu Guo; Lianli Gao; Xuanhan Wang; Yuxuan Hu; Xing Xu; Xu Lu; Heng Tao Shen; Jingkuan Song

一般から特定へ：バランス調整による有益なシーングラフの生成

シーングラフ生成（SGG）タスクは、画像内の視覚的関係のトリプレット、つまり主語、述語、目的語を検出し、シーンを理解するための構造的なビジョンレイアウトを提供することを目的としています。ただし、現在のモデルは、「立っている」や「見ている」などの有益な述語ではなく、「オン」や「で」などの一般的な述語に固執しているため、正確な情報と全体的なパフォーマンスが失われます。モデルが画像を説明するために「ブロック」ではなく「道路上の石」のみを使用している場合、シーンを誤解しやすくなります。この現象は、有益な述語と一般的な述語の間の2つの主要な不均衡、つまり意味空間レベルの不均衡とトレーニングサンプルレベルの不均衡によって引き起こされると主張します。この問題に取り組むために、BA-SGGを提案します。これは、従来の分布フィッティングではなく、バランス調整に基づくシンプルで効果的なSGGフレームワークです。これらの不均衡を調整するために、それぞれセマンティック調整（SA）とバランス述語学習（BPL）の2つのコンポーネントを統合します。モデルにとらわれないプロセスの恩恵を受けて、私たちの方法は最先端のSGGモデルに簡単に適用され、SGGのパフォーマンスを大幅に向上させます。私たちの方法は、Visual Genomeの3つのシーングラフ生成サブタスクで、Transformerモデルよりもそれぞれ14.3％、8.0％、6.1％高い平均リコール（mR）を達成します。コードは公開されています。

The scene graph generation (SGG) task aims to detect visual relationship triplets, i.e., subject, predicate, object, in an image, providing a structural vision layout for scene understanding. However, current models are stuck in common predicates, e.g., "on" and "at", rather than informative ones, e.g., "standing on" and "looking at", resulting in the loss of precise information and overall performance. If a model only uses "stone on road" rather than "blocking" to describe an image, it is easy to misunderstand the scene. We argue that this phenomenon is caused by two key imbalances between informative predicates and common ones, i.e., semantic space level imbalance and training sample level imbalance. To tackle this problem, we propose BA-SGG, a simple yet effective SGG framework based on balance adjustment but not the conventional distribution fitting. It integrates two components: Semantic Adjustment (SA) and Balanced Predicate Learning (BPL), respectively for adjusting these imbalances. Benefited from the model-agnostic process, our method is easily applied to the state-of-the-art SGG models and significantly improves the SGG performance. Our method achieves 14.3%, 8.0%, and 6.1% higher Mean Recall (mR) than that of the Transformer model at three scene graph generation sub-tasks on Visual Genome, respectively. Codes are publicly available.

updated: Mon Aug 30 2021 11:39:43 GMT+0000 (UTC)

published: Mon Aug 30 2021 11:39:43 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト