LANDMARK: Language-guided Representation Enhancement Framework for Scene Graph Generation

Xiaoguang Chang; Teng Wang; Shaowei Cai; Changyin Sun

ランドマーク: シーングラフ生成のための言語ガイド付き表現拡張フレームワーク

シーングラフ生成 (SGG) は、複雑な視覚的特徴とデータセットのロングテール問題の両方に苦しむ洗練されたタスクです。最近、新しい損失関数とデータバランシング戦略を設計することにより、さまざまな偏りのない戦略が提案されています。残念ながら、これらの偏りのない方法では、機能の改良の観点から言語の優先順位を強調できません。述語は、主語と目的語のペアおよびグローバルコンテキストに隠されたセマンティクスと高度に相関しているという事実に着想を得て、言語と視覚の対話パターン、グローバルな言語コンテキスト、およびペアから述語関連の表現を学習する LANDMARK (LANguage-guiDed representationenhanceMent frAmewoRK) を提案します。述語相関。具体的には、最初にオブジェクトラベルを、異なる表現学習用の 3 つの特徴的なセマンティック埋め込みに投影します。次に、Language Attention Module (LAM) と Experience Estimation Module (EEM) は、主語と目的語の単語の埋め込みを処理して、それぞれアテンションベクトルと述語分布を生成します。言語コンテキストモジュール (LCM) は、各単語の埋め込みからグローバルコンテキストをエンコードし、ローカル情報からの孤立した学習を回避します。最後に、モジュールの出力を使用して、視覚的表現と SGG モデルの予測を更新します。すべての言語表現はオブジェクトカテゴリから純粋に生成されるため、追加の知識は必要ありません。このフレームワークはモデルにとらわれず、既存の SGG モデルのパフォーマンスを一貫して向上させます。さらに、表現レベルの偏りのない戦略により、LANDMARK は他の方法との互換性という利点を得ることができます。コードは https://github.com/rafa-cxg/PySGG-cxg で入手できます。

Scene graph generation (SGG) is a sophisticated task that suffers from both complex visual features and dataset long-tail problem. Recently, various unbiased strategies have been proposed by designing novel loss functions and data balancing strategies. Unfortunately, these unbiased methods fail to emphasize language priors in feature refinement perspective. Inspired by the fact that predicates are highly correlated with semantics hidden in subject-object pair and global context, we propose LANDMARK (LANguage-guiDed representationenhanceMent frAmewoRK) that learns predicate-relevant representations from language-vision interactive patterns, global language context and pair-predicate correlation. Specifically, we first project object labels to three distinctive semantic embeddings for different representation learning. Then, Language Attention Module (LAM) and Experience Estimation Module (EEM) process subject-object word embeddings to attention vector and predicate distribution, respectively. Language Context Module (LCM) encodes global context from each word embed-ding, which avoids isolated learning from local information. Finally, modules outputs are used to update visual representations and SGG model's prediction. All language representations are purely generated from object categories so that no extra knowledge is needed. This framework is model-agnostic and consistently improves performance on existing SGG models. Besides, representation-level unbiased strategies endow LANDMARK the advantage of compatibility with other methods. Code is available at https://github.com/rafa-cxg/PySGG-cxg.

updated: Thu Mar 02 2023 09:03:11 GMT+0000 (UTC)

published: Thu Mar 02 2023 09:03:11 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト